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Abstract 



We present an extension of Felsenstein's algorithm to indel models denned on 
entire sequences, without the need to condition on one multiple alignment. The 
algorithm makes use of a generalization from probabilistic substitution matrices 
to weighted finite-state transducers. Our approach may equivalently be viewed 
as a probabilistic formulation of progressive multiple sequence alignment, us- 
ing partial-order graphs to represent ensemble profiles of ancestral sequences. 
We present a hierarchical stochastic approximation technique which makes this 
algorithm tractable for alignment analyses of reasonable size. 

Keywords: Multiple alignment, Felsenstein pruning, transducers, insertion 
deletion, ancestral sequence reconstruction. 
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1 Background 

Felsenstein's pruning algorithm is routinely used throughout bioinformatics and 
molecular evolution pQ . A few common applications include estimation of sub- 
stitution rates [5] ; reconstruction of phylogenetic trees [3J ; identification of con- 
served (slow-evolving) or recently-adapted (fast-evolving) elements in proteins 
and DNA [4]; detection of different substitution matrix "signatures" (e.g. puri- 
fying vs diversifying selection at synonymous codon positions [5], hydrophobic 
vs hydrophilic amino acid signatures [5], CpG methylation in genomes [7J, or 
basepair covariation in RNA structures (5]); annotation of structures in genomes 
[TO] ; and placement of metagenomic reads on phylogenetic trees [TT] . 

The pruning algorithm computes the likelihood of observing a single column 
of a multiple sequence alignment, given knowledge of an underlying phyloge- 
netic tree (including a map from leaf-nodes of the tree to rows in the alignment) 
and a substitution probability matrix associated with each branch of the tree. 
Crucially, the algorithm sums over all unobserved substitution histories on in- 
ternal branches of the tree. For a tree containing N taxa, the algorithm achieves 
O(N) time and memory complexity by computing and tabulating intermediate 
probability functions of the form G n (x) = P(Y n \x n = x), where x n represents 
the individual residue state of ancestral node n, and Y n = {y m } represents all 
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the data at leaves {in} descended from node n in the tree (i.e. the observed 
residues at all leaf nodes m whose ancestors include node n). 

The pruning recursion visits all nodes in postorder. Each G n function is 
computed in terms of the functions G; and G r of its immediate left and right 
children (assuming a binary tree): 

G n {x) = P(Y n \x n =x) 

[ (E Sl B$ziGltoj) (l2 Xr B x% r G r (x r )j if n is not a leaf 
I 8{x = y n ) if n is a leaf 

where = P(x n = b\x m = a) is the probability that node n has state 6, 

given that its parent node m has state a; and S(x — y n ) is a Kronecker delta 
function terminating the recursion at the leaf nodes of the tree. 

The "states" in the above description may represent individual residues (nu- 
cleotides, amino acids), base-pairs (in RNA secondary structures) or base-triples 
(codons). Sometimes, the state space is augmented to include gap characters, or 
latent variables. In the machine learning literature, the G n functions are often 
described as "messages" propagated from the leaves to the root of the tree [T^] , 
and corresponding to a summary of the information in the subtree rooted at n. 

The usual method for extending this approach from individual residues to 
full-length sequences assumes both that one knows the alignment of the se- 
quences, and that the columns of this alignment are each independent realiza- 
tions of single-residue evolution. One uses pruning to compute the above like- 
lihood for a single alignment column, then multiplies together the probabilities 
across every column in the alignment. For an alignment of length L, the time 
complexity is O(LN) and the memory complexity O(N). This approach works 
well for marginalizing substitution histories consistent with a single alignment, 
but does not readily generalize to summation over indcl histories or alignments. 

The purpose of this manuscript is to introduce another way of extending 
Felsenstein's recursion from single residues (or small groups of residues) to en- 
tire, full-length sequences, without needing to condition on a single alignment. 
With no constraints on the algorithm, the time and memory complexities are 
0(L N ), with close similarity to the algorithms of Sankoff [T3j and Hein [T^] . 
This drops to 0(L 2 N) when a stochastic lower-bound approximation is used; 
in this form, the algorithm is similar to the partial order graph for multiple 
sequence alignment [TS] ■ The complexity drops further to O(LN) if a "guide 
alignment" is provided as a clue to the algorithm. The guide alignment is not a 
hard constraint, and may be controllably relaxed, or dispensed with altogether. 

The new algorithm is, essentially, algebraically equivalent to Felsenstein's 
algorithm, if the concept of a "substitution matrix" over a particular alphabet 
is extended to the countably-infinite set of all sequences over that alphabet. 
Our chosen class of "infinite substitution matrix" is one that has a finite rep- 
resentation: namely, the finite-state transducer, a probabilistic automaton that 
transforms an input sequence to an output sequence, and a familiar tool of 
statistical linguistics [16]. 
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In vector form, Felsenstein's pruning recursion is 



q _ ( (B (;) G() o (B( r )G r ) if n is not a leaf 
" 1 V(y ra ) if n is a leaf 

where AoB is the pointwise (Hadamard) product and V(x) is the unit column 
vector in dimension x. By generalizing a few key algebraic ideas from matrices 
to transducers (matrix multiplication, the pointwise product, row vectors and 
column vectors), we are able to interpret this vector-space form of Felsenstein's 
algorithm as the specification of a composite phylogenetic transducer that spans 



all possible alignments (see Subsection 3.11.5) 



The transducer approach offers a natural generalization of Felsenstein's prun- 
ing recursion to indels, since it can be used to calculate 



p(s\t,9) = J2p(s,a\t, i 



i.e. the likelihood of sequences S given tree T and parameters 9, summed over 
all alignments A. Previous attempts to address indels phylogenetically have 
mostly returned P(S\A,T, 9) where A represents a single alignment (typically 
estimated by a separate alignment program, which may introduce undetermined 
biases). The exceptions to this rule are the "statistical alignment" methods 
[T7J [HJ [THl [l9l [20] which also marginalize alignments in an unbiased way — albeit 
more slowly, since they use Markov Chain Monte Carlo methods (MCMC). 
In this sense, the new algorithm may be thought of as a fast, non-MCMC 
approximation to statistical alignment. 

The purpose of this manuscript is a clean theoretical presentation of the 
algorithm. In separate work (manuscript submitted) we find that the algorithm 
appears to recover more accurate reconstructions of simulated phylogenetic indel 
histories, as indicated by proxy statistics such as the estimated indel rate. 

The use of transducers in bioinformatics has been reported before [5TJ 1221 
l2"3l including an application to genome reconstruction that is conceptually 
similar to what we do here for proteins (24) . However, this paper represents a 
clearer presentation of the underlying mathematics. In particular, to maximize 
accessibility, we have chosen to use a formulation of finite-state transducers that 
closely mirrors the formulation available on Wikipedia at the time of writing 
(http : //en . wikipedia . org/wiki/Finite_state_transducer|). 



1.1 Document structure 

We will begin with a narrative, "tutorial" overview that introduces the main the- 
oretical concepts using a small worked example. Following this we will present 
general, precise, technical definitions. The informal overview makes extensive 
use of illustrative figures, and is intended to be easily read, even by someone 
not familiar with transducer theory. The technical description is intended pri- 
marily for those wishing to integrate the internals of this method into their own 
algorithms. Either section may be read in isolation. 
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Each example presented in this informal section (e.g. single transducers, 
composite transducers, sampled paths) correspond to rigorously-defined math- 
ematical constructs defined in the technical section. Whenever possible, we 
provide references between the examples and their technical definitions. 

The final section of the tutorial, Subsection |2.8[ gives a detailed account of 
the connections between the tutorial and formal sections. 



2 Informal tutorial on transducer composition 

In this section we introduce (via verbal descriptions and graphical representa- 
tions) the various machines and manners of combining them necessary for the 
task of modeling evolution on the tree shown in Figure [I] (The arrangement 
of machines required for this tree is shown in Figure [2j) While the conceptual 
underpinnings of our algorithm are not unusually complex, a complete mathe- 
matical description demands a significant amount of technical notation (which 
we provide in Subsection [3]). For this reason, we aim to minimize notation in 
this section, instead focusing on a selection of illustrative example machines 
ranging from simple to complex. 

We first describe the sorts of state machines used, beginning with simple 
linear machines (which appear at the leaves of the tree in Figure [2]) and moving 
on to the various possibities of the branch model. Then we describe (and provide 
examples for) the techniques which allow us to co-ordinate these machines on a 
phylogeny: composition and intersection. 

Finally, we outline how combinations of these machines allow a straight- 
forward definition of Felsenstein's pruning algorithm for models allowing inser- 
tion/deletion events, and a stochastic approximation technique which will allow 
inference on datasets of common practical size. 



2.1 Transducers as input-output machines 

We begin with a brief definition of transducers from Wikipedia. These ideas are 



defined with greater mathematical precision in Subsection 3.1 

A finite state transducer is a finite state machine with two tapes: an input 
tape and an output tape. ... An automaton can be said to recognize a string if 
we view the content of its tape as input. In other words, the automaton com- 
putes a function that maps strings into the set {0, 1}. (f f f) Alternatively, we 
can say that an automaton generates strings, which means viewing its tape 
as an output tape. On this view, the automaton generates a formal language, 
which is a set of strings. The two views of automata are equivalent: the func- 
tion that the automaton computes is precisely the indicator function of the set 
of strings it generates. . . Finite State Transducers can be weighted, where each 
transition is labeled with a weight in addition to the input and output labels. 



http : //en . wikipedia . org/wiki/Finite_state_transducer 



(f f f ) For a weighted transducer this mapping is, more generally, to the non- 
negative real axis [0, oo) rather than just the binary set {0, 1}. 
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Figure 1: Example tree used in this tutorial. The TKF91 model is used as 
the branch transducer model, but our approach is applicable to a wider range 
of string transducer models. 



In this tutorial section we are going to work through a small examples of 
using transducers on a tree for three tiny protein sequences (MF, CS, LIV). 
Specifically, we will compute the likelihood of the tree shown in Figure [l] ex- 
plaining the common descent of these three sequences under the so-called TKF91 
model (Figure 16), as well as a simpler model that only allows point substitu- 
tions. To do this we will construct (progressively, from the bottom up) the 
ensemble of transducer machines shown in Figure [2l We will see that the full 
state space of Figure is equivalent to Hein's 0(Lt) alignment algorithm for 
the TKF91 model |14]7 Q(NL 2 ) progressive alignment corresponds to a greedy 
Viterbi/maximum likelihood-traceback approximation, and partial-order graph 
alignment corresponds to a Forward/stochastic-traceback approximation. 
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TKF91 model, B 




CS-recognizer, V(C5) 




Figure 2: An ensemble of transducers modeling the likelihood of the tree shown 
in Figure [lj We write this as U ■ (B ■ {B ■ V(LIV)) o (B ■ V(MF))) o (B ■ V(CS)). 
The terms in this expression represent individual component transducers: 1Z is 



shown in Figure 18 B is shown in Figure |17| V(LIV) is in Figure |10| V(MF) in 
Figure 11 and V(C5) in Figure 12 (The general notation V(. 



is introduced 

in Subsection |2. 2.1 and formalized in Subsection |3.7| ) The operations for com- 
bining these transducers, denoted "•" and "o" , are — respectively — transducer 



composition (introduced in Subsection 2.3 formalized in Subsection 3.4 1 and 



transducer intersection (introduced in Subsection 2.5 formalized in Subsec- 
tion 3.5 ). The full state graph of this transducer is not shown in this manuscript: 



even for such a small tree and short sequences, it is too complex to visualize eas- 
ily (the closest thing is Figure |42j which represents this transducer configuration 
minus the root generator, TV). 
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Figure 3: Generator for protein sequence MF. This is a trivial state machine 
which emits (generates) the sequence MF with weight (probability) 1. 




Figure 4: Recognizer for protein sequence LIV. This is a trivial state machine 
which absorbs (recognizes) the sequence LIV with weight (probability) 1, and 
all other sequences with weight 0. 

2.1.1 Generators and recognizers 

As noted in the Wikipedia quote, transducers can be thought of as general- 
izations of the related concepts of generators (state machines that emit out- 
put sequences, such as HMMs) and parsers or recognizers (state machines that 
match/parse input sequences, such as the UNIX 'lex' program). Both gener- 
ators and recognizers are separate special cases of transducers. Of particular 
use in our treatment are generators/recognizers that generate/recognize a single 
unique sequence. Generators and recognizers are defined with greater precision 
in Subsection 13.81 and Subsection 13.71 

Figure [3] is an example of a generator that uniquely generates (inserts) the 
protein sequence MF. Figure [4] is an example of a recognizer that uniquely 
recognizes (and deletes) the protein sequence LIV. 

These Figures illustrate the visual notation we use throughout the illustrative 
Figures of this tutorial. States and transitions are shown as a graph. Transi- 
tions can be labeled with absorption/emission pairs, written x/y where x is the 
absorbed character and y the emitted character. Either x or y is allowed to be 
the empty string (shown in these diagrams as the gap character, a hyphen). In a 
Figure that shows absorption/emission pairs, if there is no absorption/emission 
labeled on a transition, then it can be assumed to be — /— (i.e. no character is 
absorbed or emitted) and the transition is said to be a "null" transition. 

Some transitions are also labeled with weights. If no transition label is 
present, the weight is usually 1 (some more complicated diagrams omit all the 
weights, to avoid clutter). The weight of a path is defined to be the product of 
transition weights occuring on the path. 

The weight of an input-output sequence-pair is the sum over all path weights 
that generate the specified input and output sequences. The weight of this 
sequence-pair can be interpreted as the joint probability of both (a) successfully 
parsing the input sequence and (b) emitting the specified output sequence. 
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Note sometimes this weight is zero — e.g. in Figure[4]the weight is zero except 
in the unique case that the input tape is LIV, when the weight is one — this in 
fact makes Figure[4]a special kind of recognizer: one that only recognizes a single 
string (and recognizes that string with weight one). We call this an exact-match 
recognizer. 

More generally, suppose that G, R and T are all probabilistically weighted 
finite-state transducers: G is a generator (output only), R is a recognizer (input 
only) and T is a general transducer (input and output). Then, conventionally, 
G defines a probability distribution P(Y\G) over the emitted output sequence 
Y; R defines a probability P(accept|X, R) of accepting a given input sequence 
X; and T defines a joint probability P(accept, Y\X, T) that input X will be 
accepted and output Y emitted. According to this convention, it is reasonable 
to expect these weights to obey the following (here f2* denotes the set of all 
sequences) : 

E p ( y i G ) = 1 

Yen* 

P(accept|X,i?) < 1 VIefi* 
E ^(accept, Y\X,T) < 1 Vlefi* 

It is important to state that these are just conventional interpreta- 
tions of the computed weights: in principle the weights can mean anything 
we want, but it is common to interpret them as probabilities in this way. 

Thus, as noted in the Wikipedia quote, generators and recognizers are in 
some sense equivalent, although the probabilistic interpretations of the weights 
are slightly different. In particular, just as we can have a generative profile that 
generates some sequences with higher probability than others (e.g. a profile 
HMM) we can also have a recognition profile: a transducer that recognizes some 
sequences with higher probability than others. The exact-match transducer of 
Figure [4] is a (trivial and deterministic) example of such a recognizer; later we 
will see that the stored probability vectors in the Felsenstein pruning recursion 
can also be thought of as recognition profiles. 

2.2 Moore machines 

In our mathematical descriptions, we will treat transducers as Mealy machines, 
meaning that absorptions and emissions are associated with transitions between 
states. In the Moore machine view, absorptions and emissions are associated 
with states. In our case, the distinction between these two is primarily a se- 
mantic one, since the structure of the machines and the emission functions of 
the states is intimately tied. 

The latter view (Moore) can be more useful in bioinformatics, where rapid 
point substitution means that all combinations of input and output characters 
are possible. In such situations, the Mealy type of machine can suffer from an 
excess of transitions, complicating the presentation. For example, consider the 
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Figure 5: All combinations of input and output characters are frequently ob- 
served. In a Mealy-machine-like representation, each possible pair of characters 
must be given its own transition, complicating the visualization. 





h 




Figure 6: In a condensed Moore-machine-like representation, possible combina- 
tions of input and output characters are encoded in the distributions contained 
within each state, simplifying the display. For most Figures in the remainder of 
this manuscript, we will omit the blue "x/y" labels on transitions, as they are 
implied by the state type of the destination state. 



Mealy-machine-like view of Figure [51 and compare it with the more compact 
Moore-machine-like view of Figure [6j 

Figure [7] shows the visual notation we use in this tutorial for Moore-form 
transducer state types. There are seven state types: Start, Match, Insert, Delete, 
End, Wait, and Null, frequently abbreviated to S, M, I, D, E, W, N. State types 

Note that in the TKF91 model (Fig- 



arc defined precisely in Subsection 3.2 
ure 



16 the example we use for most of this tutorial) there are exactly one of 



each of these types, but this is not a requirement. For instance, the transducer 
in Figure [20| has two Insert, two Delete and three Wait states, while Figure [T4| 
has no Insert or Delete states at all; Figure [9] has no Match states; and so on. 
Some other features of this view include the following: 



• The shape and color of states indicates their type. The six Moore normal 
form states are all red. Insert states point upwards; Delete states point 
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Figure 7: In a Moore machine, each state falls into one of several types (a trans- 
ducer may contain more than one state of each type). A state's type determines 
its I/O capabilities: an Insert state emits (writes) an output character, a Delete 
state absorbs (reads) an input character, and a Match state both absorbs an 
input character and emits an output character. Mnemonics: Insert states point 
upwards, Delete states point Downwards, Wait states are octagonal like Stop 
signs. 



downwards; Match states are rectangles; Wait states are octagons (like 
U.S. stop signs); Start is a circle and End is a diamond. There is a seventh 
state type, Null, which is written as a black circle to distinguish it from 
the six Moore-form state types. (Null states are often nuisance states that 
must be eliminated.) This visual shorthand will be used throughout. 

• We impose certain constraints on states that involve I/O: they must be 
typed as Insert, Delete, or Match, and their type determines what kinds 
of I/O happens on transitions into those states (e.g. a Match state always 
involves an absorption and an emission). 

• We impose certain constraints on transitions into I/O states: their weights 
must be factorizable into transition and I/O components. Suppose j is a 
Match state and i is a state that precedes j; then all transitions i — >■ j 
must both absorb a non-gap input character x and emit a non-gap output 

character y, so the transition can be written i — % j and the transition 
weight must take the form tjj x ej(x, y) where fcy is a component that can 
depend on the source and destination state (but not the I/O characters) 
and ej(x,y) is a component that can depend on the I/O characters and 
the destination state (but not the source state). 

• We can then associate the "7/ O weight {unction" ej with Match state j and 
the "transition weight" tij with a single conceptual transition i j that 

summarizes all the transitions i j (compare Figure [5] and Figure 6 ) . 

• The function ej can be thought of as a conditional-probability substitution 
matrix (for Match states, c.f. Q in Figure [6]), a row vector representing a 
probability distribution (for Insert states, c.f. U in Figure (6J), or a column 
vector of conditional probabilities (for Delete states, c.f. V in Figure [6]). 

• Note that we call ej an "I/O function" rather than an "emit function". 
The latter term is more common in bioinformatics HMM theory; however, 
ej also describes probabilistic weights of absorptions as well as emissions, 
and we seek to avoid ambiguity. 
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Figure 8: Allowed transitions between types of transducer states, along with 
their I/O requirements. In particular, note the Wait state(s) which must pre- 
cede all absorbing (Match and Delete) states — the primary departure from the 
familiar pair HMM structure. Wait states are useful in co-ordinating multi- 
ple connected trandsucers, since they indicate that the transducer is "waiting" 
for an upstream transducer to emit a character before entering an absorbing 
state. Also note that this graph is not intended to depict a particular state ma- 
chine, but rather it shows the transitions which are permitted between the types 
of states of arbitrary machines under our formalism. Since the TKF91 model 
(Figure 17) contains exactly one state^f each type, its structure is similar to 



this graph (but other transducers may have more or less than one state of each 
type). 



Figure [8] shows the allowed types of transition in Moore- normal form trans- 
ducers. In our "Moore-normal form" for transducers, we require that all input 
states (Match, Delete) are immediately preceded in the transition graph by a 
Wait state. This is useful for co-ordinating multiple transducers connected to- 
gether as described in later sections, since it requires the transducer to "wait" for 
a parent transducer to emit a character before entering an absorbing state. The 
downside is that the state graph sometimes contains a few Wait states which 
appear redundant (for example, compare Figure [9] with Figure [3] or Figure 10 



with Figure [4]). For most Figures in the remainder of this manuscript, we will 
leave out the blue "x/y" labels on transitions, as they are implied by the state 
type of the destination state. 

Note also that this graph depicts the transitions between types of states 
allowed under our formalism, rather than a particular state machine. It happens 



that the TKF91 model (Figure 17) contains exactly one state of each type, so its 



state graph appears similar to Figure |8j but this is not true of all transducers. 

2.2.1 Generators and recognizers in Moore normal form 

We provide here several examples of small transducers in Moore normal form, 
including versions of the transducers in Figure [3] and Figure [4j 

We introduce a notation for generators (A) and recognizers (V); a useful 
mnemonic for this notation (and for the state types in Figure [7| is "insertions 
point up, deletions point down" . 

• Figure [9] uses our Moore-machine visual representation to depict the gen- 
erator in Figure [3] We write this transducer as A(MF). 

• Figure [10] is a Moore-form recognizer for sequence LIV. We write this 
transducer as V(LIV). The state labeled S z (for Z G {L, I, V}) has I/O 
function 5(x = Z), defined to be 1 if x — Z, otherwise. The machine 
recognizes sequence LIV with weight 1, and all other sequences with weight 
0. 

• Figure [TT] is the Moore-machine recognizer for MF, the same sequence 
whose generator is shown in Figure[9j We write this transducer as V(MF). 

• Figure [12] is the Moore-machine recognizer for sequence CS. We write this 
transducer as V(C5). 

• Figure [13] is a "null model" generator that emits a single IID sequence 
(with residue frequency distribution tt) of geometrically-distributed length 
(with geometric parameter p) . 



2.2.2 Substitution and identity 



Figure 14 shows how the Moore-normal notation can be used to represent a 
substitution matrix. The machine pauses in the Wait state before absorbing 
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Figure 9: Transducer A(MF), the Moore-normal form generator for protein 
sequence MF. The states are labeled S (Start), E (End), %m and ip (Insert 
states that emit the respective amino acid symbols), and Wp (a Wait state that 
pauses after emitting the final amino acid; this is a requirement imposed by our 
Moore normal form). The state labeled %z (for Z 6 {M,F}) has I/O function 
5{y = Z). 




Figure 10: Transducer \7(LIV), the Moore- normal form recognizer for protein 
sequence LIV. The states are labeled S (Start), E (End), 5l, Si and 8y (Delete 
states that recognize the respective amino acid symbols), Wl, Wi and Wy (Wait 
states that pause after recognizing each amino acid; these are requirements 
imposed by our Moore normal form). The states have been grouped (enclosed 
by a rectangle) to show four clusters: states that are visited before any of 
the sequence has been recognized, states that are visited after "L" has been 
recognized, states that are visited after "I" has been recognized, and states that 
are visited after "V" has been recognized. The I/O function associated with 
each Delete state Sz is S(x = Z). 




Figure 11: Transducer V(MF), the Moore-normal form recognizer for protein 
sequence MF. The states are labeled S (Start), E (End), 8m and 8p (Delete 
states that recognize the respective amino acid symbols), Wm and Wp (Wait 
states that pause after recognizing each amino acid; these are requirements 
imposed by our Moore normal form). The states have been grouped (enclosed 
by a rectangle) to show four clusters: states that are visited before any of 
the sequence has been recognized, states that are visited after "M" has been 
recognized, and states that are visited after "F" has been recognized. The I/O 
function associated with each Delete state 8z is 8(x = Z). 
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Figure 12: Transducer V(C5), the Moore-normal form recognizer for protein 
sequence CS. The states are labeled S (Start), E (End), Sc and 5s (Delete states 
that recognize the respective amino acid symbols), Wc and Ws (Wait states that 
pause after recognizing each amino acid; these are requirements imposed by our 
Moore normal form). The states have been grouped (enclosed by a rectangle) to 
show four clusters: states that are visited before any of the sequence has been 
recognized, states that are visited after "C" has been recognized, and states 
that are visited after "S" has been recognized. The I/O function associated 
with each Delete state Sz is S(x = Z). 




n{y) 



Figure 13: Transducer Af, a simple null-model generator with geometric length 
parameter p and residue frequency distribution it. 
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Figure 14: Transducer S(Q) ("the substituter" ) introduces substitutions (ac- 
cording to substitution matrix Q) but no indels. Whenever the machine makes 
a transition into the rectangular Match state, a character x is read from the in- 
put and a character y emitted to the output, with the output character sampled 
from the conditional distribution P{y\x) — Q xy . 







6(x = y) 







Figure 15: Transducer X, the identity transducer, simply copies the input tape 
to the output tape. This can be thought of as a special case of the substituter 
transducer in Figure [14] where the substitution matrix is the identity matrix. 
Formal definitions: Transducer X is defined in Subsection 13.61 



each residue x and emitting a residue y according to the distribution Q xy . Since 
there are no states of type Insert or Delete, the output sequence will necessarily 
be the same length as the input. 

This is something of a trivial example, since it is certainly not necessary to 
use transducer machinery to model point substitution processes. Our aim is to 
show explicitly how a familiar simple case (the Felsenstein algorithm for point 
substitution) is represented using the more elaborate transducer notation. 

Figure fl5] shows the identity transducer, X, a special case of the substituter: 
X = S{5), where 

f 1 x = y 
xv \ x^y 

The identity essentially copies its input to its output directly. It is defined 
formally in Subsection |3.6| 
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2.2.3 The TKF91 model as a transducer 



We use the TKF91 model of indels [25] as an example, not because it is the best 
model of indels (it has deficiencies, most notably the linear gap penalty); rather, 
because it is canonical, widely- known, and illustrative of the general properties 
of transducers. 



The TKF91 transducer with I/O functions is shown in Figure 16 The 
underlying continuous-time indel model has insertion rate A and deletion rate [i. 
The transition weights of the transducer, modeling the stochastic transformation 
of a sequence over a finite time interval t, are a = exp(— fit) is the "match 
probability" (equivalently, 1 — a is the deletion probability), b = ^ ^ZaXcxp(xt) * s 
the probability of an insertion at the start of the sequence or following a match, 
and c = x(i-a) ^ s ^ ne probability of an insertion following a deletion. These 
may be derived from analysis of the underlying birth-death process [25] . 

The three I/O functions for Match, Delete and Insert states are defined as 
follows: conditional on absorbing character x, the Match state emits y with 
probability exp(Rt) xy , where R is the rate matrix governing character substitu- 
tions and t is the time separating the input and output sequences. Characters 
are all deleted with the same weight: once a Delete state is entered the absorbed 
character is deleted with weight 1 regardless of its value. Inserted characters y 
are emitted according to the equilibrium distribution tt v . 

In the language of [25], every state path contains a Start— > Wait segment 
that corresponds to the "mortal link" , and every Wait— > Wait tour corresponds 
to an "immortal link" . 



Figure 17 is a version of Figure 16 where, rather than writing the I/O weight 



functions directly on each state (as in Figure 16), we have instead written a 



state label (as in Figure 10 Figure 11 and Figure 12). The state labels are 



S,M,I,D,E,W (interpretation: Start, Match, Insert, Delete, End, Wait). 

It has been shown that affine-gap versions of the TKF91 model can also be 
represented using state machines |26| . An approximate affine-gap transducer, 
based on the work of Knudsen [57] , Rivas [25] et al, is shown in Figure [l9] a 
version that approximates a two-component mixture-of-geometric indel length 



distributions is shown in Figure 20 while a transducer that only inserts/deletes 
in multiples of 3 is in Figure [2T] 

2.2.4 A generator for the equilibrium distribution of the TKF91 
model 

TKF91 is an input/output transducer that operates on individual branches. 
In order to arrange this model on a phylogenetic tree, we need a generating 
transducer at the root. Shown in Figure [T8| is the equilibrium TKF91 generator, 



conceptually the same as Figure 13 with insertion weight X//J, (which is the 
parameter of the geometric length distribution of sequences at equilibrium under 
the TKF91 model). 

We have now defined all the component transducers we need to model evolu- 
tion along a phylogeny. Conceptually, we can imagine the equilibrium machine 
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Figure 16: The TKF91 transducer, labeled with transition weights and I/O 
functions. This transducer models the sequence transformations of the TKF91 



model [25]. The transition weights a,b, c are defined in Subsection 2.2.3 Fig- 
ure 17 shows the same transducer with state type labels rather than I/O func- 
tions. 



generating a sequence at the root of the tree which is then repeatedly passed 
down along the branches, each of which mutates the sequence via a TKF91 
machine (potentially introducing substitutions and indels). The result is a set 
of sequences (one for each node) , whose evolutionary history is recorded via the 
actions of each TKF91 branch machine. What remains is to detail the various 
ways of connecting transducers by constraining some of their tapes to be the 
same — namely composition and intersection. 

2.3 Composition of transducers 

The first way of connecting transducers that we will consider is "composition" : 
feeding the output of one transducer, T, into the input of another, U. The 
two transducers are now connected, in a sense, since the output of T must be 
synchronized with inputs from U : when T emits a character, U must be ready 
to absorb a character. 

We can equivalently consider the two connected transducers as a single, 
composite machine that represents the connected ensemble of T followed by U . 
From two transducers T and U , we make a new transducer, written TU (or 
T ■ U), wherein every state corresponds to a pair (t,u) of T- and [/-states. 

In computing the weight of an input /output sequence pair (X,Y), where 
the input sequence X is fed to T and output Y is read from U, we must sum 
over all state paths through the composite machine TU that are consistent with 
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Figure 17: Transducer B, the TKF91 model on a branch of length t. This 
transducer models the sequence transformations of the TKF91 model [25]. The 
machine shown here is identical to that of Figure |T6| in nearly all respects. The 
only difference is this: in order that we can later refer to the states by name, 
rather than writing the I/O weight functions directly on each state we have 
instead written a state type label S, M, I, D, E, W (Start, Match, Insert, Delete, 
End, Wait). It so happens that the TKF91 transducer has one of each of these 
kinds of state. Identically to Figure [l6j the transition weights a, b, c are defined 
in Subsection 2.2. 3| For each of the I/O states (I, D and M) we must, of course, 
still specify an I/O weight function. These are also identical to Figure 16 and 
are reproduced here for reference: exp(Rt) is the substitution matrix for the 
M (Match) state, it is the vector of weights corresponding to the probability 
distribution of inserted characters for the / (Insert) state, and (1,1,..., 1) is the 
vector of weights corresponding to the conditional probabilities that any given 
character will be deleted by the D (Delete) state (in the TKF91 model, deletion 
rate is independent of the actual character being deleted, which is why these 
delete weights are all 1). 
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Figure 18: Transducer TZ, the equilibrium TKF91 generator. The equilibrium 



distribution for the TKF91 model is essentially the same generator as Figure 13 
with p = A//i. The I/O function for the / (Insert) state is ir y . Note that 
this is the limit of Figure [16] on a long branch with empty ancestor: TZ = 
lim^oo (A(e) • B(t)). 




Figure 19: An approximate affine-gap transducer. Note that in contrast to 
Figure [THJ this machine requires two Wait states, with the extra Wait state 
serving to preserve the "memory" of being inside a deletion. The parameters 
of this model are insertion rate A, deletion rate fi, gap extension probability g, 
substitution rate matrix R, inserted residue frequencies tt (typically the equilib- 
rium distribution, so nR = 0), and branch length t. The transition weights use 
the quantities a = exp(— fit) and b = 1 — cxp(— At). The transducer in Figure 18 
serves as a suitable root generator. 
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Figure 20: A transducer that models indel length distributions as a 2- 
component mixture of geometric distributions. Note that this machine requires 
two separate copies of the Insert and Delete states, along with three Wait states, 
with two of the Wait states serving to preserve the "memory" of being inside 
a deletion from the respective Delete states. The parameters of this model are 
insertion rate A, deletion rate fj,, gap extension probabilities gi and <?2 for the 
two mixture components, first component mixture strength /, substitution rate 
matrix R, inserted residue frequencies ir (typically the equilibrium distribution, 
so 7T i? = 0), and branch length t. The transition weights use the quantities 
a = exp(—fii) and b = 1 — cxp(— At). The transducer in Figure 18 serves as a 
suitable root generator. 
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Figure 21: A transducer whose indels are a multiple of 3 in length. (This is 
not in strict normal form, but could be made so by adding a Wait state after 
every Delete state.) 



[X, Y). In doing so, we are effectively summing over the intermediate sequence 
(the output of T, which is the input of U), just as we sum over the intermedi- 
ate index when doing matrix multiplication. In fact, matrix multiplication of 
I/O functions is an explicit part of the algorithm that we use to automatically 



construct composite transducers in Subsection 3.4 The notation (TU or T ■ U) 
reflects the fact that transducer composition is directly analogous to matrix 
multiplication. 

Properties of composition include that if T is a generator then TU is also a 
generator, while if U is a recognizer then TU is also a recognizer. A formal def- 
inition of transducer composition, together with an algorithm for constructing 
the composite transducer TU, is presented in Subsection |3.4| This algorithmic 
construction of TU (which serves both as a proof of the existence of TU, and 
an upper bound on its complexity) is essential as a formal means of verifying 
our later results. 

2.3.1 Multiplying two substitution models 

As a simple example of transducer composition, we turn to the simple substi- 
tuting transducers in Figure [14] and Figure [22] Composing these two in series 

models two consecutive branches x y z, with the action of each branch 
modeled by a different substitution matrix, Q and R. 



Constructing the T ■ U composite machine (shown in Figure 23 1 simply in- 
volves constructing a machine whose Match state I/O function is the matrix 
product of the component substitution matrices, QR. If x denotes the input 
symbol to the two-transducer ensemble (input into the Q-substituter), y de- 
notes the output symbol from the two-transducer ensemble (output from the 
R-substituter) , and z denotes the unobserved intermediate symbol (output from 
Q and input to R), then the I/O weight function for the composite state QR in 
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Figure 22: Transducer S(R) introduces substitutions (rate matrix R) but no 
indels. Compare to Figure [l4j which is the same model but with substitution 
matrix Q instead of R. 










(QR)xy 







Figure 23: Transducer S(Q)-S(R), the composition of Figure 14 and Figure 22 
Note that this is just S(QR): in the simple case of substituters, transducer 
composition is exactly the same as multiplication of substitution matrices. In 
the more general case, transducer composition is always analogous to matrix 
multiplication, though (unlike matrices) transducers tend in general to require 
more representational storage space when multiplied together (not the case here, 
but see e.g. Figure 24). 
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Figure 24: Transducer B ■ B, the composition of the TKF91 model (Figure 171 
with itself, representing the evolutionary drift over two consecutive branches of 
the phylogenetic tree. Each composite state is labeled with the states of the 
two component transducers. I/O weight functions, state labels, and meanings 
of transition parameters are explained in Table [T] and Subsection |2.3.2| For- 
mal definitions: The algorithmic construction of the composite transducer is 
described in Subsection 13.41 



the two-transducer ensemble is 

z 

2.3.2 Multiplying two TKF91 models 

For a more complex example, consider the composition of a TKF91 transducer 
with itself. This is again like two consecutive branches x — > y — > z, but now 
TKF91 is acting along each branch, rather than a simple substitution process. 
An input sequence is fed into first TKF91; the output of this first TKF91 
transducer is an intermediate sequence that is fed into the input of the second 
TKF91; output of this second TKF91 is the output of the entire ensemble. 
The composite machine inputs a sequence and outputs a sequence, just like 



2G 



the machine in Figure 23 but these sequences may differ by insertions and 
deletions as well as substitutions. The intermediate sequence (emitted by the 
first TKF91 and absorbed by the second) is unobserved, and whenever we sum 
over state paths through the composite machine, we are essentially summing 
over values for this intermediate sequence. 

Figure [24] shows the state space of this transducer. The meaning of the 
various states in this model are shown in Table [TJ 

The last two states in Table [l] 77 and MI, appear to contradict the co- 
ordination of the machines. Consider, specifically, the state MI: the first trans- 
ducer is in a state that produces output (M) while the second transducer is in 
a state which does not receive input (I). The solution to this paradox is that 
the first transducer has not emitted a symbol, despite being in an M state, 
because symbols are only emitted on transitions into M states; and inspection 



of the composite machine (Figure 24 1 reveals that transitions into state MI 
only occur from states MD or MM. Only the second transducer changes state 
during these transitions; the M state of the first transducer is a holdover from 
a previous emission. The interpretation of such transitions is well-specified by 



our formal construction (Subsection 3.4 1 



In a specific sense (summing over paths) this composite transducer is equiva- 
lent to the single transducer in Figure [T6| but with the time parameter doubled 
(t — > 2t). We can write this statement as B(t) ■ B(t) = B(2t). This state- 
ment is, in fact, equivalent to a form of the Chapman-Kolmogorov equation, 
B(t)B(t') = B(t + t'), for transition probability matrices B(t) of a stationary 
continuous-time Markov chain. In fact TKF91 is currently the only nontrivial 
transducer known to have this property (by "nontrivial" we mean including all 
types of state, and so excluding substitution-only models such as Figure [l4j 
which are essentially special limits of TKF91 where the indel rates are zero). 
An open question is whether there are any transducers for affine-gap versions of 
TKF91 which have this property (excluding TKF92 from this since it does not 
technically operate on strings, but rather sequences of strings (fragments) with 
immovable boundaries) . The Chapman-Kolmogorov equation for transducers is 
stated in Subsection l3.101 

2.3.3 Constraining the input to the substitution model 

Figure 25 is the composition of the MF-generator (Figure[9]) with the Q-substituter 



(Figure 14 ). This is quite similar to a probabilistic weight matrix trained on the 
single sequence MF, as each of the Insert states has its own emission probability 
weights. 

2.3.4 Constraining the input to the TKF91 model 

Figure 26 is the composition of the MF-generator (Figure [9| with TKF91 (Fig- 
ure 17). In contrast to Figure 25 this machine generates samples from the dis- 
tribution of descendants of a known ancestor (MF) via indels and substitutions. 
It is conceptually similar to a profile HMM trained on the single sequence MF 
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Both transducers are in the End state. No charac- 
ters are absorbed or emitted. 




WW 


Both transducers are in the Wait state. No charac- 
ters are absorbed or emitted, but both are poised to 

dUDCl U DCt[UClHjCQ 11 Ulll L11C11 111J.J LI L L dJJCD . 




SI 


The second transducer inserts a character y while 


TTy 




the first remains in the Start state. 


IM 


Trip mt"Q+/ T"t*£i nQri 11 ppt" Pmit/Q £i PrifiTfipf/PY" i^nti tnp Ttigptt' 

_I_ I1C HIoL LI dllSU. LIL-Cl C1I11LS d Clldl dCLCI 1 V Id UllL. AllOCI L 


(tt PTmT RtW 

y/l CAJJ^JLL j jy 




state) which the second transducer absorbs, mutates 
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The second transducer inserts a character y while 
the first remains in the Insert state from a previous 
insertion. Only the second transducer is changing 
state (and thus emitting a character) here — the first 
is resting in an Insert state from a previous transi- 
tion. 


y 


MI 


The second transducer inserts a character y while 
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the first remains in the Match state from a previous 




match. Only the second transducer is changing state 






(and thus emitting a character) here — the first is 






resting in a Match state from a previous transition. 





Table 1: Meanings (and, where applicable, I/O functions) of all states in 
transducer B ■ B (Figure [24| , the composition of two TKF91 transducers (an 
individual TKF91 transducer is shown in Figure |l7|) . Every state corresponds 
to a tuple (61,62) where b\ is the state of the first TKF91 transducer and 62 is 
the state of the second TKF91 transducer. 




91) and its subsequent mutation via 
14|) . Since no gaps are involved, and 



Figure 25: Transducer A(MF) ■ S(Q) corresponds to the generation of se- 
quence MF (by the transducer in Figure 
substitutions (by the transducer in Figure 

the initial sequence length is specified (by Figure [9]) , there is only one state 
path through this transducer. State types are indicated by the shape (as per 
Figure [7|), and the I/O functions of the two Insert states are indicated. For 
more information see Subsection 12.3.31 



Figure 26: Transducer A(MF) ■ B, the composition of the MF-generator (Fig- 
ure [9| with the TKF91 transducer (Figure 17). This can be viewed as an HMM 
that generates sequences sampled from the distribution of TKF91-mutated de- 
scendants of ancestor MF. See Subsection |2.3.4| and Table [2] for the meaning of 
each state. 
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Both transducers are in the Start state. No charac- 
ters are absorbed or emitted. 
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Both transducers are in the End state. No charac- 
ters are absorbed or emitted. 
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The generator emits the M character, which TKF91 
absorbs via its Match state and emits character y. 
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The generator emits the M character, which TKF91 
absorbs via its Delete state. 
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TKF91 inserts a character y while the generator re- 
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mains in the Insert state from which it emitted its 




last character (M). 
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The generator emits the F character, which TKF91 
absorbs via its Match state and emits character y. 
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The generator emits the F character, which TKF91 
absorbs via its Delete state. 
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TKF91 inserts a character y while the generator re- 


TTy 




mains in the Insert state from which it emitted its 




last character (F). 





Table 2: Meanings (and, where applicable, I/O functions) of all states in 
transducer A(MF) ■ B (Figure 26), the composition of the MF-generator (Fig- 
ure]^) with the TKF91 transducer (Figure 17). Since this is an ensemble of two 
transducers, every state corresponds to a pair (i, b) where i is the state of the 
generator transducer and b is the state of the TKF91 transducer. 



(though without the Dirichlet prior distributions that are typically used when 
training a profile HMM). 

The meaning of the various states in this machine are shown in Table [2] 
As noted in Subsection |2.3[ a generator composed with any transducer is still 
a generator. That is, the composite machine contains no states of type Match 
or Delete: it accepts null input (since its immediately-upstream transducer, the 
MF-generator, accepts null input). Even when the TKF91 is in a state that 
accepts input symbols (Match or Delete), the input symbol was inserted by 
the MF-generator; the MF-generator does not itself accept any input, so the 
entire ensemble accepts no input. Therefore the entire composite machine is a 
generator, since it accepts null input (and produces non-null output). 
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Figure 27: Transducer S(Q) ■ V(MF) 'recognizes' sequences ancestral to MF: 
it computes the probability that a given input sequence will mutate into MF, 
assuming a point substitution model. In contrast to the machine in Figure [25} 
this machine accepts an ancestral sequence as input and emits null output. Note 
than a sequence of any length besides 2 will have recognition weight zero. 




Figure 28: Transducer S(Q) ■ V(C5) 'recognizes' sequences ancestral to CS: 
it computes the probability that a given input sequence will mutate into CS, 
assuming a point substitution model. In contrast to the machine in Figure [25} 
this machine accepts an ancestral sequence as input and emits null output. Note 
than a sequence of any length besides 2 will have recognition weight zero. 



2.3.5 Constraining the output of the substitution model 



Figure 27 and Figure 28 show the composition of the substituter with (respec- 
tively) the MF-recognizer and the CS-recognizer. These machines are similar 
to the one in Figure |25[ but instead of the substituter taking its input from a 
generator, the order is reversed: we are now feeding the substituter's output to 
a recognizer. The composite machines accept sequence on input, and emit null 
output. 

2.3.6 Constraining the output of the TKF91 model 

Figure[29}shows the composition of a TKF91 transducer with the MF-recognizer. 
It is worth comparing the recognizer in Figure [29] with the analogous generator 



Figure 29: Transducer B ■ V(MF) computes the probability that a given 
ancestral input sequence will evolve (via the TKF91 model) into the specific 
descendant MF. It can therefore be thought of as a recognizer for the ancestor 
of MF. This machine absorbs sequence on its input and has null output due 
to the recognizer's null output. See Subsection 2.3.6 and Table [3] for more 
information. 
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Figure 30: Transducer A(MF) ■ S(Q) ■ V(C5) is a pure Markov chain (with no 
I/O) whose single Start— > End state path describes the transformation of MF to 
CS via substitutions only. 



in Figure 26 While similar, the generator and the recognizer are not the same; 



for a sequence S, Figure [29] with S as input computes P(MF\S) (probability 



that descendant of S is MF), whereas Figure 26 with S as output computes 
P(S\MF) (probability that descendant of MF is S). 

The meaning of the various states in this model are shown in Table [3] 

2.3.7 Specifying input and output to the substitution model 



Figure 30 shows the composition of the MF-generator, substituter, and CS- 
recognizer. This machine is a pure Markov chain (with no inputs or outputs) 
which can be used to compute the probability of transforming MF to CS. Specifi- 
cally, this probability is equal to the weight of the single path through Figure [30] 
Note that composing A(MF) ■ S(Q) ■ V(LIV) would result in a state graph 
with no valid paths from Start to End, since S(Q) cannot change sequence 
length. This is correct: the total path weight of zero, corresponding to the 
probability of transforming MF to LIV by point substitutions alone. 

2.3.8 Specifying input and output to the TKF91 model 



Figure 31 shows the composition of three transducers, the MF-generator, TKF91 
and LIV-recognizer (transitions are omitted for clarity) . This is one step beyond 
the machine in Figure [30j as it allows insertions and deletions to separate the 
generated (MF) and recognized (CS) sequences. 

The meaning of the various states in this model are shown in Table [4] 



Figure 31 like Figure 30 contains only null states (no match, insert, or delete 
states), making it a pure Markov chain with no I/O. Such Markov models can 
be viewed as a special case of an input/output machine where the input and 
output are both null. (As noted previously, a hidden Markov model corresponds 
to the special case of a transducer with null input, i.e. a generator.) 

Probability P{Y = LTV\X = MF) for the TKF91 model is equal to sum of 



all path weights from start to end in the Markov model of Figure 31 The set 
of paths corresponds to the set of valid evolutionary transformations relating 
sequences MF and LIV (valid meaning those allowed under the model, namely 
non-overlapping single-residue indels and substitutions). 

This is directly analogous to computing a pairwise alignment (e.g. using the 
Needlman-Wunsch algorithm), and the structure of the Markov model shown in 



Figure 31 suggests the familiar structure of a pairwise dynamic programming 
matrix. 
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bd 


Meaning 


I/O fn. 


SS 


Both transducers are in the Start state. No charac- 
ters are absorbed or emitted 




EE 


Both transducers are in the End state. No charac- 
ters are absorbed or emitted 




WW 


Both transducers are in Wait states. No characters 
are absorbed or emitted* both are noised to accent 
input. 




DW 


TKF91 absorbs a character a; and deletes it; recog- 
nizer remains in the initial Wait state (Wo). 


1 




TKF91 emits a character (M) via an Insert state and 
it is recognized by the 5m state of the recognizer. 






r FT^F91 absorbs a character t via a A/Tatch state 
then emits an M, which is recognized by the 5m 
state of the recognizer. 




WW M 


Both transducers are in Wait states: TKF91 in its 
only Wait state and the recognizer in the Wait state 
following the M character. 




DW m 


TKF91 absorbs a character x and deletes it; recog- 
nizer remains in the Wait state (Wm)- 


i 


15 F 


TKF91 emits a character (F) via an Insert state and 
it is recognized by the 5f state of the recognizer. 




MS F 


TKF91 absorbs a character x via a Match state, 
then emits an F, which is recognized by the 5f state 
of the recognizer. 


exp(Rt) xF 


WW F 


Both transducers are in Wait states: TKF91 in its 
only Wait state and the recognizer in the Wait state 
following the F character. 




DW f 


TKF91 absorbs a character x and deletes it; recog- 
nizer remains in the Wait state (Wf)- 


1 



Table 3: Meanings (and, where applicable, I/O functions) of all states in trans- 



ducer B ■ V(MF) (Figure |29[), the composition of the TKF91 model (Figure[l7l 
with the MF-recognizer (Figure 11). Since this is an ensemble of two transduc- 
ers, every state corresponds to a pair (b, d) where b is the state of the TKF91 
transducer and d is the state of the recognizer transducer. 
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Figure 31: Transducer A(MF) ■ B ■ V(LiY) is a Markov chain (that is to 
say, a state machine, but one with no input or output characters) wherein every 
Start— i-End state path describes an indel history via which sequence MF mutates 
to sequence LIV. Note that the structure of this Markov chain is very similar to a 
dynamic programming matrix for pairwise sequence alignment. The "rows" and 
"cells" of this matrix are shown as boxes. Computation of the total Start— >End 
path weight, using a recursion that visits all the states in topological sort order, 
is similar to the Forward algorithm of HMMs. See Subsection |2.3.8| and Table [4] 
for more information on the states of this model. 
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Meaning 

All transducers are in the Start state. No characters are emitted 
or absorbed. 

All transducers are in their Wait states which precede the End 
states for each transducer. No characters are emitted or ab- 
sorbed. 

All transducers are in their End states. No characters are emit- 
ted or absorbed. 

The component TKF91 machine emits a character y via the 
Insert state (/) which is read by the S y state of the recognizer; 
the generator remains in the Start state (S), not yet having 
emitted any characters. 

The component TKF91 machine emits a character y via the 
Insert state (I) which is read by the S y state of the recognizer; 
the generator remains in an Insert state (t x ) corresponding to 
the last character x which it generated. 

The generator emits character x via an Insert state (i x ); TKF91 
absorbs the x and emits a y via a Match state (M); the recog- 
nizer absorbs the y character via state 5 y . 

The generator emits character x via an Insert state {i x )i and 
TKF91 absorbs and deletes it via a Delete state (D). The rec- 
ognizer remains in W y , the Wait state following character y (or 
Wq if the recognizer has not yet read any characters). 



Table 4: Meanings of states in transducer A(MF) ■ B ■ V{LIV) (Figure [31]), 
the composition of the MF-generator (Figure [9]), TKF91 (Figure 17) and the 
LIV-recognizer (Figure 10). Since this is an ensemble of three transducers, 
every state corresponds to a triple (i,b,d) where i is the state of the generator 
transducer, b is the state of the TKF91 transducer and d is the state of the 
recognizer transducer. Where states are labeled with x or y suffices (e.g. SI6 y ), 
then the definition is valid Vie {M, F} and My e {L, I, V}. 
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2.4 Removal of null states 



In Figure [24j the state ID is of type Null: it corresponds to an insertion by the 
first TKF91 transducer that is then immediately deleted by the second TKF91 
transducer, so there is no net insertion or deletion. 

It is often the case that Null states are irrelevant to the final analysis. (For 
example, we may not care about all the potential, but unlikely, insertions that 
were immediately deleted and never observed.) It turns out that is often useful 
to eliminate these states; i.e. transform the transducer into an equivalent trans- 
ducer that does not contain null states. (Here "equivalent" means a transducer 
that defines the same weight for every pair of I/O sequences; this is made precise 



in Subsection 3.1 ) Fortunately we can often find such an equivalent transducer 
using a straightforward matrix inversion |29j . Null state removal is described in 
more detail in Subsection 13. 121 and Subsection 13. 19.31 



2.5 Intersection of transducers 

Our second operation for connecting two transducers involves feeding the same 
input tape into both transducers in parallel, and is called "intersection" . 

As with a transducer composition, an intersection is constructed by taking 
the Cartesian product of two transducers' state spaces. From two transducers 
T and U , we make a new transducer T o U wherein every state corresponds to 
a pair (t, u) of T- and [/-states, and whose output tape constitutes a pairwise 
alignment of the outputs of T and U. 

A property of intersection is that if T and U are both recognizers, then there 
is no output, and so T o U is a recognizer also. 

Conversely, if either T or U is not a recognizer, then T o U will have an 
output; and if neither T or U is a recognizer, then the output of To U will be a 
pairwise alignment of T's output with C/'s output (equivalently, we may regard 



T o U as having two output tapes). Transducer Q n in Subsection 3.11.6 is an 
example of such a two-output transducer. Indeed, if we do further intersections 
(such as (ToU)oV where V is another transducer) then the resultant transducer 
may have several output tapes (as in the general multi-sequence HMMs defined 



]). The models denoted F n in Subsection 3.11.4 are of this nature. 



A formal definition of transducer intersection, together with an algorithm 



for constructing the intersected transducer Toll, is presented in Subsection 3.5 
Like the construction of TU, this algorithmic construction ofToU serves both 
as a proof of the existence of TU, and an upper bound on its complexity It is 
also essential as a formal means of verifying our later results. 



2.5.1 Composition, intersection and Felsenstein's pruning algorithm 

If a composition is like matrix multiplication (i.e. the operation of evolution 
along a contiguous branch of the phylogenetic tree) , then an intersection is like 
a bifurcation at a node in the phylogenetic tree, where a branch splits into two 
child branches. 
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Intersection corresponds to the pointwise multiplication step in the Felsen- 
stein pruning algorithm i.e. the calculation 

P(dcsccndants|parcnt) = P(lcft child and its descendants|parent)P(right child and its descendants! parent) 

Specifically, in the Felsenstein algorithm, we define G^ (x) to be the proba- 
bility of all observed descendants of node n, conditional on node n having been 
in state x. Let us further suppose that is the conditional substitution 

matrix for the branch above node n (coming from n's parent), so 

= P(node n is in state j|parent of node n is in state i) 

Then we can write the core recursion of Felscnstcin's pruning algorithm in ma- 
trix form; is a column vector, M^> is a matrix, and the core recursion 



is 



where (I, r) are the left- and right-children of node n, "•" denotes matrix mul- 
tiplication, and "o" denotes the pointwise product (also called the Hadamard 
product), defined as follows for two vectors A and B: 



(A o B)i = AiBi 



AoB = 



( A X B X 
A 2 B 2 
A 3 B 3 



\ 



\ A K B K J 



Thus the two core steps of Felsenstein's algorithm (in matrix notation) are 
(a) matrix multiplication and (b) the pointwise product. Composition provides 
the transducer equivalent of matrix multiplication; intersection provides the 
transducer equivalent of the pointwise product. 

Note also that G^ n \x) is the probability of node n's observed descendants 
conditional on x, the state of node n. Thus G^ n > is similar to a recognition 
profile, where the computed weight for a sequence S represents the probability 
of some event (recognition) conditional on having S as input, i.e. a probability of 
the form P(. . . \S) (as opposed to a probability distribution of the form P(S\ . . .) 
where the sequence S is the output, as is computed by generative profiles). 

Finally consider the initialization step of the Felsenstein algorithm. Let n be 
a leaf node and y the observed character at that node. The initialization step is 

G (n) (x) =S(x = y) 

i.e. we initialize G^ with a unit column vector, when n is a leaf. This unit 
vector is, in some sense, equivalent to our exact-match recognizer. In fact, gen- 
erators and recognizers are analogous to (respectively) row-vectors and column- 
vectors in our infinite-dimensional vector space (where every dimension rep- 
resents a different sequence). The exact-match recognizer V(5) resembles a 
unit column-vector in the 5*-dimension, while the exact-match generator A(5) 
resembles a unit row-vector in the S'-dimension. 
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a 



Figure 32: The transducer (S(Q) ■ V(CS)) o (S(Q) ■ V(MF)) recognizes the 
common ancestor of MF and CS. 



2.5.2 Recognizer for a common ancestor under the substitution model 

Figure [32] shows the intersection of Figure |27| and Figure [28} This is an ensemble 
of four transducers. Conceptually, what happens when a sequence is input is 
as follows. First, the input sequence is duplicated; one copy is then fed into a 



substituter (Figure 14 1, whose output is fed into the exact-matcher for CS (Fig- 
ure Il2|) ; the other copy of the input sequence is fed into a separate substituter 



(Figure 22), whose output is fed into exact-matcher for MF (Figure 11). 

The composite transducer is a recognizer (the only I/O states are Deletes; 
there are no Matches or Inserts), since it absorbs input but the recognizers 
(exact-matchers) have null output. Note also that the I/O weight functions in 
the two Delete states in Figure 32 are equivalent to the G^ probability vectors 



in Felsenstein's algorithm (Subsection 2.5.1) 



Since this is an ensemble of four transducers, every state corresponds to a 
tuple (61,^1,62,(^2) where 61 is the state of the substituter transducer on the 



CS-branch (Figure 14), d\ is the state of the exact-matcher for sequence CS 
(Figure 12), 6 2 is the state of the substituter transducer on the MF-branch 



(Figure 22), d\ is the state of the exact-matcher for sequence MF (Figure 11). 

The I/O label for the general case where (61,(^1,62,^2) = {Q, 5a, R, <5b) 
denotes the vector whose elements are given by Q x aRxB- So, for example, the 
I/O label for state (Q, Sc, R, &m) is the vector whose x'th entry is the probability 
that an input symbol x would mutate to C on the Q-branch and M on the R- 
branch. 



2.5.3 



The Felsenstein likelihood for a two-branch tree using the sub- 
stitution model 



As noted, the previous machine (Figure 32 ) can be considered to compute the 
Felsenstein probability vector G^ n ' at an internal node n of a tree. When n is 
the root node of the tree, this must be multiplied by a prior over sequences if 
we are to compute the final Felsenstein probability, 



P(sequences|tree) = itG^ = ^xG 1 ^ 



In transducer terms, this "multiplication by a prior for the root" is a com- 
position: we connect a root generator to the input of the machine. Composing 
Figure 13 (a simple prior over root sequences: geometric length distribution 



and IID with frequencies n) with the transducer of Figure 32 (which represents 
G«) yields a pure Markov chain whose total path weight from Start to End is 
the final Felsenstein probability (Figure 33). Furthermore, sampling traceback 
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Figure 33: Transducer TV- ((5(Q) • V(C5))o (5(Q) • V(MF))) allows computing 
the final Felsenstein probability for the two sequence MF and CS descended from 
an unobserved ancestor by character substitutions. Since characters emitted 
by the JV transducer are all absorbed by the two recognizers, the composite 
machine has null input and output. The probabilities (involving the geometric 
parameter p, the prior tt and the substitution matrix Q) are all encoded on the 
transitions. The use of summation operators in the weights of these transitions 
reflects the fact that multiple transitions have been collapsed into when by our 
composition algorithm, which marginalizes I/O characters that are not directly 
observed. Note that this machine is equivalent to the composition of a null 



model (Figure 13 1 with the machine in Figure 32 



paths through this machine provides an easy way to sample from the posterior 
distribution over ancestral sequences relating MF and CS. 



2.5.4 Recognizer for a common ancestor under the TKF91 model 

The transducer shown in Figure [34j is the recognition profile for the TKF91- 
derived common ancestor of LIV and MF. LIV and MF may each differ from the 
ancestor by insertions, deletions, and substitutions; a particular path through 
this machine represents one such explanation of differences (or, equivalently, an 
alignment of LIV, MF, and their ancestor). This type of machine is denoted G n 
in Subsection 13 . 1 1 . 5l 

Figure |34]is an ensemble of four transducers. Conceptually, what happens to 
an input sequence is as follows. First, the input sequence is duplicated; one copy 



of the input is fed into TKF91 (Figure 17 1, whose output is fed into the exact- 
matcher for LIV (Figure 10); the other copy of the input is fed into a separate 
TKF91 machine (Figure 17), whose output is fed into the exact-matcher for MF 
(Figure 11 ). 

Since this is an ensemble of four transducers, every state corresponds to 
a tuple (61,^1,621^2) where 61 is the state of the TKF91 transducer on the 



LIV-branch (Figure 17), d\ is the state of the exact-matcher for sequence LIV 



(Figure 10), 62 is the state of the TKF91 transducer on the MF-branch (Fig- 
ure 17), and d\ is the state of the exact-matcher for sequence MF (Figure 11 ). 



Note that, as with Figure [311 the underlying structure of this state graph is 
somewhat like a DP matrix, with rows, columns and cells. In fact, (modulo some 
quirks of the automatic graph layout performed by graphviz's 'dot' program) 
and Figure [31] are structurally quite similar. However, compared to 

: 'cell", because this transducer 



Figure 34 has more states in each 



Figure 
Figure 

tracks events on two separate branches (whereas Figure 31 only tracks one 
branch) . 

Since this machine is a recognizer, it has Delete states but no Match or Insert 
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Figure 34: Transducer (B-V(LIV))o(B-V(MF)) recognizes the ancestor of LIV 
and MF, assuming common descent under the TKF91 model. As with Figure [3l| 
the structure of this machine is very similar to a dynamic programming matrix 
for pairwise sequence alignment; again, the "rows" and "cells" of this matrix 
have been shown as boxes. In contrast to Figure |31[ this machine is not a 
Markov model (ancestral sequence already encoded in the state space), but a 
recognizer (ancestral sequence read from input tape). This Figure, along with 
several others in this manuscript, was autogenerated using phylocomposer [22j 
and graphviz [3D] . Formal definitions: This type of machine is denoted G n in 
Subsection |3.11.5| The algorithms for constructing composite and intersected 



transducer state graphs, are reviewed in Subsection 3.4 and Subsection 3.5 
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states. The delete states present do not necessarily correspond to deletions by 
the TKF91 transducers: all characters are ultimately deleted by the exact-match 
recognizers for LIV and MF, so even if the two TKF91 transducers allow symbols 
to pass through undeleted, they will still be deleted by the exact-matchers. 

In fact, this transducer's states distinguish between deletion events on one 
branch vs insertion events on the other: this is significant because a "deleted" 
residue is homologous to a residue in the ancestral sequence, while an "inserted" 
residue is not. There are, in fact, four delete states in each "cell" of the matrix, 
corresponding to four fates of the input symbol after it is duplicated: 

1. Both copies of the input symbol pass successfully through respective TKF91 
transducers and are then deleted by respective downstream exact-matchers 
for sequences LIV and MF (e.g. MD L MD F ); 

2. One copy of the input symbol is deleted by the TKF91 transducer on the 
LIV-branch, leaving the downstream LIV-matcher idling in a Wait state; 
the other copy of the input symbol passes through the TKF91 transducer 
on the MF-branch and is then deleted by the downstream MF-matcher; 
(e.g. DW Q MD F ); 

3. One copy of the input symbol passes through the TKF91 transducer on 
the LIV-branch and is then deleted by the downstream LIV-matcher; the 
other copy is deleted by TKF91 transducer on MF-branch, leaving the 
downstream MF-matcher idling in a Wait state; (e.g. MD^DWo); 

4. Both copies of the input symbol are deleted by the respective TKF91 trans- 
ducers, while the downstream exact-matchers idle in Wait states without 
seeing any input (e.g. DWqDWq). 

The other states in each cell of the matrix are Null states (where the sym- 
bols recognized by the LIV- and MF-matchers originate from insertions by the 
TKF91 transducers, rather than as input symbols) and Wait states (where the 
ensemble waits for the next input symbol). 



2.5.5 The Felsenstein likelihood for a two-branch tree using the 
TKF91 model 

If we want to compute the joint marginal probability of LIV and MF as siblings 
under the TKF91 model, marginalizing the unobserved common ancestor, we 
have to use the same trick that we used to convert the recognizer in Figure [32] 
into the Markov model of Figure 33 That is, we must connect a generator 
to the input of Figure 34 where the generator emits sequences from the root 
prior (equilibrium) distribution of the TKF91 model. Using the basic generator 
shown in Figure |18| we construct the machine shown in Figure 35 (This type 



of machine is denoted M n in Subsection 3.11.6 ) 



Summing over all Start — » End paths in Figure [35] the total weight is the 
final Felsenstein probability, i.e. the joint likelihood P(LIV, AfF|tree, TKF91). 
Sampling traceback paths through Figure [35] yields samples from the posterior 



41 



Figure 35: Transducer K-((B- V(LIV)) o (B ■ V(MF))) models the generation 
of an ancestral sequence which is then duplicated; the two copies are mutated 
(in parallel) by two TKF91 transducers into (respectively) LIV and MF. This 
machine is equivalent to the generator in Figure [18] coupled to the input of the 
recognizer in Figure [34) Because it is a generator coupled to a recognizer, there 
is no net input or output, and so we can think of this as a straightforward 
Markov model, albeit with some probability "missing" : the total sum-over- 
paths weight from Start— > End is less than one. Indeed, the total sum-over-paths 
weight through this Markov model corresponds to the joint likelihood of the two 
sibling sequences, LIV and MF. Formal definitions: This type of machine is 
denoted M n in Subsection |3.11.6[ 
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Figure 36: The highest-weight path through the machine of Figure 35 



responds to the most likely evolutionary history relating the two sequences. 
Equivalently, this path corresponds to an alignment of the two sequences and 
their ancestors. Formal definitions: The subset of the state graph corre- 



sponding to this path is denoted M' n in Subsection 3.11.6 



distribution over common ancestors of LIV and MF. The sum-over-paths is 
computed via a form of the standard Forward dynamic programming algorithm, 
described in Subsection l3.161 

This ability to sample paths will allow us to constrain the size of the state 
space when we move from pairs of sequences to entire phylogenetic trees. Tracing 
paths back through the state graph according to their posterior probability is 
straightforward once the Forward matrix is filled; the algorithmic internals are 
detailed in Subsection I3.17.2l 



2.5.6 Maximum likelihood ancestral reconstruction under the TKF91 
model 



In Figure 36 the highest-weight (Viterbi) traceback path is highlighted. This 
path, via the significances of each of the visited states, corresponds to an ances- 
tral alignment relating LIV and MF: an alignment of the sequences and their 
ancestor. 

Ancestral sequence * * * 
e.g. Sequence 1 L V I 

Sequence 2 M F - 

Computing this alignment/path is straightforward, and essentially amounts 
to choosing the highest-probability possibility (as opposed to sampling) in each 



step of the traceback detailed in section Subsection 3.17.2 



If we remove the root generator, we obtain a linear recognizer profile for the 
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Figure 37: Removing the generator from Figure |36| leaves a recognition profile 
for the ancestral sequence relating MF and LIV. Formal definitions: This and 



related transformations to sampled state paths are described in Subsection 3.19 
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Figure 38: Transducer V\ is a linear recognition profile for the ancestor relating 
MF and LIV, created by taking the states visited by the Viterbi path shown in 
Figure [37) Formal definitions: This machine, and the one in Figure 141] are 



examples of the recognition profiles E n described in Subsection 3.11.6 
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Figure 39: Additionally to finding just the highest-weight path through the 



machine of Figure 35 (as in Figure 36 1, it is possible to sample suboptimal paths 

Here, two sampled paths through 
The subset of the state graph 

The 



proportional to their posterior probability 
the state graph are shown. Formal definitions 

covered by the set of sampled paths is denoted M' in Subsection 



3.11.6 



mathematical detail of sampling paths is described in Subsection |3.17.2 



sequence at the ancestral node (Figure 38 1 , as might be computed by progressive 



alignment. This can be thought of as a profile HMM trained on an alignment 
of the two sequences MF and LIV. It is a machine of the same form as E n in 
Subsection 13.11.61 



2.5.7 Sampling ancestral reconstructions under the TKF91 model 

Instead of only saving the single highest-weight path through the machine of 



Figure 35 (as in Figure 36), we can sample several paths from the posterior 
probability distribution over paths. In Figure [39} two sampled paths through the 
state graph are shown. Tracing paths back through the state graph according to 
their posterior probability is straightforward once the Forward matrix is filled; 



the algorithmic internals are detailed in Subsection 3.17.2 



It is possible that many paths through the state graph have weights only 
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Figure 40: Two sample paths through the machine in Figure [35| representing 
possible evolutionary histories relating MF and LIV. These are the same two 
paths as in Figure 39 but we have removed the root generator (as we did to 
transform Figure 36 into Figure 37 1 . Paths are sampled according to their poste- 
rior probability, allowing us to select a high-probability subset of the state graph. 
The mathematical details of sampling paths is described in Subsection |3 . 1 7. 2| 



slightly less than the Viterbi path. By sampling suboptimal paths according 
to their weights, it is possible to retain and propogate this uncertainty when 
applying this method to progressive alignment. Intuitively, this corresponds 
to storing multiple evolutionary histories of a subtree as progressive alignment 
climbs the phylogenetic tree. 
For instance, in addition to sampling this path 
Ancestral sequence * 
Sequence 1 L 
Sequence 2 M 

we also have this path 

Ancestral sequence * 
Sequence 1 L 
Sequence 2 M 
Combining these paths into a single graph and relabeling again, we still 
have a recognition profile, but it is now branched, reflecting the possible un- 
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Figure 41: Transducer V2 is a branched recognizer for the ancestor common 
to MF and LIV. The branched structure results from sampling multiple paths 
through the state graph in Figure [35] mapping the paths back to Figure [34] 
and retaining only the subset of the graph visited by a sampled path. Formal 
definitions: This machine, and the one in Figure [38] are examples of the 
recognition profiles E n described in Subsection|3.11.6 



tions required to convert Figure 39 into Figure 



certainty in our alignment/ancestral prediction, shown in Figure 41 (which is a 
transducer of the form E„ in Subsection [3TL6|. The exact series of transforma- 

(removing the root generator, 
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eliminating null states, and adding wait states to restore the machine to Moore 



normal form) is detailed in Subsection 3.19 



Note that these are all still approximations to the full recognition profile 
for the ancestor, which is Figure [34] While we could retain this entire profile, 
progressively climbing up a tree would add so many states to the graph that 
inference would quickly become an intractable problem. Storing a subset allows 
a flexible way in which to retain many high-probability solutions while still 
allowing for a strict bound on the size of the state space. 



2.5.8 Ancestral sequence recognizer on a larger TKF91 tree 



Figure 
Figure 



42] shows the full recognition profile for the root-ancestral sequence in 
1] representing all the possible evolutionary histories relating the three 



sequences. For clarity, many transitions in this diagram have been removed 
or collapsed. (The recognition transducer for the root profile is denoted G\ in 
Subsection 3.11.5 ) 



2.5.9 Felsenstein probability on a larger TKF91 tree 

We have now seen the individual steps of the transducer version of the Felsen- 
stein recursion. Essentially, composition replaces matrix multiplication, inter- 
section replaces the pointwise product, and the initiation/termination steps in- 
volve (respectively) recognizers and generators. 

The full recursion (with the same 0(L N ) complexity as the Sankoff algo- 
rithm |13j for simultaneously aligning N sequences of length L, ignoring sec- 
ondary structure) involves starting with exact-match recognition profiles at the 
leaves (Figure 11 Figure 10), using those to construct recognition profiles for 



the parents (Figure 34 1, and progressively climbing the tree toward the root, 
constructing ancestral recognition profiles for each internal node. At the root, 
compose the root generator with the root recognition profile, and the Forward 
probability can be computed. 
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Figure 42: Transducer (B ■ (B ■ V{LIV)) o (B ■ V(MF))) o (B ■ V(CS)) is the 
full recognition profile for the root-ancestral sequence in Figure [TJ representing 
all the possible evolutionary histories relating the three sequences. For clarity, 
many transitions in this diagram have been removed or collapsed. Formal 
definitions: This transducer, which recognizes the root sequence in Figure [l] 
is denoted G\ in Subsection |3 . 1 1 . 5) 
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Figure 43: Transducer (B ■ V\) o (B ■ V(C5)) recognizes the common ancestor 
of CS and V\ . Transducer V\ , shown in Figure [38] itself models the common 
ancestor of MF and LIV. Using profile V\, which is essentially a best-guess 
reconstructed ancestral profile, represents the most resource-conservative form 
of progressive alignment: only the maximum-likelihood indel reconstruction is 
kept during each step of the Felsenstein pruning recursion. Formal definitions: 



This type of transducer is denoted H n in Subsection 3.11.6 



2.5.10 Progressive alignment version of Felsenstein recursion 

The "progressive alignment" version, equivalent to doing Felsenstein's pruning 
recursion on a single alignment found using the progressive alignment algorithm, 
involves sampling the single best linear recognition profile of the parent at each 
internal node, as in Figure [37] 

We then repeat the process at the next level up in the tree, aligning the 

is 
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34 



is 



parent profile to its sibling, as shown in Figure 43 The machine in Figure 
an example of the transducer H n in Subsection 3.11.6 (Technically, Figure 
also an example of H n , as well as being an example of G n . The difference is that 
G n describes all possible histories below node n, since it is made by combining 
the two transducers Gi and G r for the two children (I, r) of node n. By contrast, 
H n only describes a subset of such histories, since it is made by combining the 
two transducers Ei and E r , which are subsets of the corresponding Hi and H r ; 



just as Figure 38 and Figure 41 are subsets of Figure 34 ) 
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Figure 44: Transducer (B ■ V2) o (B ■ V(C5)) shows the alignment of the 
sequence CS with the sampled profile Vi- Transducer V2, shown in Figure 41 is 



a branched profile whose different paths represent alternate ancestries of sibling 
sequences MF and LIV. Formal definitions: The type of transducer shown in 
this Figure is denoted H n in Subsection |3.1 1.6) 



This method can be recognized as a form of "sequence-profile" alignment as 
familiar from progressive alignment, except that we don't really make a distinc- 
tion between a sequence and a profile (in that observed sequences are converted 
into exact-match recognition profiles in the very first steps of the procedure). 

The "progressive" algorithm proceeds in the exact same way as the "full" 
version, except that at each internal node a linear profile is created from the 
Viterbi path through the state graph. This "best guess" of the alignment of 
each subtree is likely to work well in cases where the alignment is unambiguous, 
but under certain evolutionary parameters the alignment of a subtree may not 
be clear until more sequences are observed. 



2.6 Stochastic lower bound version of Felsenstein recur- 
sion 

Our stochastic lower bound version is intermediate to the progressive align- 
ment (Viterbi-like) algorithm and the full Sankoff algorithm. Rather than just 
sampling the best linear profile for each parent, as in progressive alignment (Fig- 



ure l37j) , we sample some fixed number of such paths (Figure 40 ) . This allows 



us to account for some amount of alignment uncertainty, while avoiding the full 



complexity of the complete ancestral profile (Figure 42). 

By sampling a fixed number of traceback paths, we can construct a recogni- 
tion profile for the ancestral sequence that is linearly bounded in size and offers 
a stochastic "lower bound" on the probability computed by the full Felsenstein 
transducer in Figure [35| 



Figure 44 shows the intersection of V2 (the ancestor of MF and LIV) with 
sequence CS, with TKF91 models accounting for the differences. Again this 
is a "sequence-profile" alignment, though it can also be called "profile-profile" 
alignment, since CS can be considered to be a (trivial) linear profile. Unlike 
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in traditional progressive alignment (but quite like e.g. partial order alignment 
[T5]). one of the profiles is now branched (because we sampled more than one 
path to construct it in Figure 40), allowing a tunable (by modulating how 
many paths are sampled in the traceback step) way to account for alignment 
uncertainty. 

Just like Figu re |43[ th e machine in Figure [44] is an example of the transducer 
H n in Subsection|3.11.6| 



Finally we compose the root generator (Figure 18) with Figure 44 (If there 



were more than two internal nodes in this tree, we would simply continue the 
process of aligning siblings and sampling a recognition profile for the parent, 
iteratng until the root node was reached.) The sum of all path weights through 
the state graph of the final machine represents the stochastic lower-bound on the 
final Fclscnstein probability for this tree. Since we have omitted some states at 
each progressive step, the computed probability does not sum over all possible 
histories relating the sequences, hence it is a lower bound on the true Felsen- 
stein probability. (Each time we create a branched profile from a subset of the 
complete state graph, we discard some low-probability states and therefore leak 
a little bit of probability, but hopefully not much.) The hope is that, in prac- 
tice, by sampling sufficiently many paths we are able to recover the maximum 
likelihood reconstruction at the root level, and so the lower bound is a good 
one. 



2.7 A Pair HMM for aligning siblings 



While we have constructed our hierarchy of phylogenetic transducers by us- 
ing recognizers, it is also sometimes useful (and perhaps more conventional in 
bioinformatics) to think in terms of generators. For example, we can describe 
the multi-sequence HMM that simultaneously emits an alignment of all of the 
sequences; this is a generator, and in fact is the model F n described in Sub- 
section 3.11.4 (An animation of such a multi-sequence generator emitting a 
small alignment can be viewed at http://youtu.be/EcLj5MSDPyM with more 
at http : //biowiki . org/PhyloFilm ) 

If we take the intersection of two TKF91 transducers, we obtain a trans- 
ducer that has one input sequence and two output sequences (or, equivalently, 
one output tape that encodes a pairwise alignment of two sequences). This 



transducer is shown in Figure 45 If we then connect a TKF91 equilibrium gen- 



erator (Figure 18 ) to the input of Figure 45 we get Figure 46 a generator with 



two output tapes, i.e. a Pair HMM. (To avoid introducing yet more tables and 
cross-references, we have confined the descriptions of the states in Figure [46] and 



Figure 45 to the respective Figure captions. 



The Pair HMM in Figure|46]is particularly useful, as it crops up in our algo- 
rithm whenever we have to align two recognizers. Specifically, Figure[35] — which 
looks a bit like a pairwise dynamic programming matrix for aligning sequence 
LIV to sequence MF — is essentially Figure [46] with one output tape connected 



to the LIV-recognizer (Figure 10) and the other output tape connected to the 



MF-recognizer (Figure 11). In computing the state space of machines like Fig- 
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ure [35] (which are called M n in Subsection 3.11.6), it is useful to precompute 
the state space of the component machine in Figure 46 (called Q n in Subsec- 
tion 3.11.6). This amounts to a run-time optimization, though it also helps us 



verify correctness of the model. 



2.8 A note on the relationship between this tutorial and 
the formal definitions section 

As noted throughout the tutorial, the example phylogeny and transducers can 
be directly related to the "Hierarchy of phylogenetic transducers" described in 
Subsection 13.111 onwards. 

Consider the tree of Figure [I] Let the nodes of this tree be numbered as 
follows: (1) Ancestral sequence, (2) Intermediate sequence, (3) Sequence LIV, 
(4) Sequence MF, (5) Sequence CS. 

Some of the transducers defined for this tree in Subsection 13.111 include 





R 


= K 


Vn e {2,3,4,5} : 


B„ 


= B 


E5 = H 5 


= G 5 


= V(CS) 


E4 = H4 


= G4 


= V(MF) 


E 3 = H 3 


= G 3 


= V(LIV) 




G 2 


= (B 3 ■ G 3 ) (Bi ■ d) 

= (B-V(LIV))o(B-V{MF)) 




H 2 


= (B 3 ■ E 3 ) (B A ■ Ei) 

= (B ■ V{LIV)) (B ■ V(MF)) 




M 2 


= RH 2 

= TZ-(B- V(LIV)) (B ■ V(MF)) 




E 2 


C H 2 




Gi 


= [B 2 ■ G 2 ) (B 5 ■ G 5 ) 

= (B-{B-V(LIV))o(B-V(MF)))o(B 




Hx 


- (B2 ■ E 2 ) (B 5 ■ E 5 ) 




Go 


= R • G\ 

= n-(B-(B-V(LIV))o(B-V(MF)))o 


Vne {1,2} : 


Qn 


= Ko(B-B) 



(B ■ V(C5)) 



Figure 
Figure 
Figure 
Figure 
Figure 



Figure [34] 
Figure [34] again 



Figure 
Figure 

Figure 
Figure 



Figure [41] 
Figure [44] 
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Figure 45: Transducer To (BoB) is a bifurcation of two TKF91 transducers. It 
can be viewed as a transducer with one input tape and two output tapes. Each 
state has the form w6i&2 where v is the state of the bifurcation transducer (Sub- 
section 3.6 1, 61 is the state of the first TKF91 machine (Figure 17) and 62 is the 



state of the second TKF91 machine. The meaning of I/O states (Match, Insert, 
Delete) is subtle in this model, because there are two output tapes. Dealing 
first with the Inserts: in states SIW and MIW, the first TKF91 transducer is 
inserting symbols to the first output tape, while in states SSI, MMI and MDI, 
the second TKF91 transducer is emitting symbols to the second output tape. 
Dealing now with the Matches and Deletes: the four states that can receive an 
input symbol are MMM, MMD, MDM and MDD. Of these, MMM emits 
a symbol to both output tapes (and so is a Match); MMD only emits a symbol 
to the first output tape (and so qualifies as a Match because it has input and 
output); MDM only emits a symbol to the second output tape (and so qualifies 
as a Match); and MDD produces no output at all (and is therefore the only 
true Delete state). 
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Figure 46: Transducer 1Z o (B o B) represents the operation of sampling a se- 
quence from the TKF91 equilibrium distribution and then feeding that sequence 
independently into two TKF91 transducers. Equivalently, it is the composition 
of the TKF91 equilibrium generator (Figure 18 ) with a bifurcation of two TKF91 
transducers (Figure 45). It can be viewed as a generator with two output tapes; 
i.e. a Pair HMM. Each state has the form ,06162 where p is the state of the 
generator, 61 is the state of the first TKF91 machine and 62 is the state of the 
second. As in Figure 45 the meaning of I/O states is subtle in this model, 
because there are two output tapes. We first deal with Insert states where one 
of the TKF91 transducers is responsible for the insertion. In states SIW and 
IIW, the first TKF91 transducer is emitting symbols to the first output tape; 
in states SSI, IDI and IMI, the second TKF91 transducer is emitting symbols 
to the second output tape. The remaining states (excluding SSS and EEE) all 
involve symbol emissions by the generator, that are then processed by the two 
TKF91 models in various ways. These four states that involve emissions by the 
generator are IMM, IMD, I DM and IDD. Of these, IMM emits a symbol to 
both output tapes (and so qualifies as an Insert state); IMD only emits a sym- 
bol to the first output tape (and is an Insert state); I DM only emits a symbol 
to the second output tape (and is an Insert state) ; and IDD produces no output 
at all (and is therefore a Null state) . Note that Figure 35 which looks a bit like 
a pairwise dynamic programming matrix, is essentially this Pair HMM with one 
output tape connected to the LIV-recognizer (Figure 10 1 and the other output 
tape connected to the MF-recognizer (Figure 111, Formal definitions: This 
type of transducer is called Q n in Subjection 3.11.6| When both its outputs 
are connected to recognizers (Hi and H r ), then one obtains a transducer of the 
form M n . 



3 Formal definitions 



This report makes our transducer-related definitions precise, including notation 
for state types, weights (i.e. probabilities), transducer composition, etc. 

Notation relating to mundane manipulations of sequences (sequence length, 
sequence concatenation, etc.) is deferred to the end of the document, so as not 
to interrupt the flow. 

We first review the letter transducer T, transduction weight W(x : [T] : y) 
and equivalence T = T' . 

We then define two operations for combining transducers: composition (TU) 
which unifies T"s output with ZJ's input, and intersection (T o U) which unifies 
T's and t/'s input. 

We define our "normal" form for letter transducers, partitioning the states 
and transitions into types {S, M, D, I, N, W, E} based on their input /output 
labeling. (These types stand for Start, Match, Delete, Insert, Null, Wait, End.) 
This normal form is common in the bioinformatics literature [31] and forms the 
basis for our previous constructions of phylogenetic transducers [3TJ [25] . 

We define exact-match and identity transducers, and give constructions of 
these. 

We define our hierarchy of phylogenetic transducers, and give constructions 
and inference algorithms, including the concept of "alignment envelopes" for 
size-limiting of transducers. 

3.1 Input-output automata 

The letter transducer is a tuple T = (Jl/, f2o> <^Sj 4>e, r j W) where 0/ is an 
input alphabet, ilo is an output alphabet, <I> is a set of states, <j)s € $ is the 
start state, 4> E S $ is the end state, r C $ x (f2/ U {e}) x (£1 U {e}) x $ is the 
transition relation, and W : r — > [0, oo) is the transition weight function. 

Transition paths: The transitions in r correspond to the edges of a labeled 
multidigraph over states in $. Let II C t* be the set of all labeled transition 
paths from <f>g to 4>e- 

I/O sequences: Let Si : II — > SI} and So '■ II be functions returning 

the input and output sequences of a transition path, obtained by concatenating 
the respective transition labels. 

Transduction weight: For a transition path tt € II, define the path weight 
W(tt) and (for sequences x £ Slj, y £ Qq) the transduction weight W(x : [T] : y) 

VV(tt) = J]_ W(t) 
W(x:[T]:y) = ]T W(tt) 

7ren, Si (7r)=x, So {t)=V 

Equivalence: If for transducers T,T' it is true that W{x : [T] : y) — W'(x : 
[T'] : y) yx,y then the transducers are equivalent in weight, T = T' . 
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3.2 State types and normal forms 



Types of state and transition: If there exists a state type function, type : $ — > T, 
mapping states to types in T = {S,M,D,I,N,W,E}, and functions W trans : 
$ 2 ->■ [0, oo) and W emit : (fij U {e}) x (fi G U {e}) x $ [0, oo), such that 
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W(0 src , wjn, "out^dest) = W (</>src, fetl^K, "out^dest) 

then the transducer is in (weak) normal form. If, additionally, $at = 0, then 
the transducer is in strict normal form. The above transition and emission 
constraints are summarized graphically in Figure [8j 

Interpretation: A normal-form transducer can be thought of as associating 
inputs and outputs with states, rather than transitions. (Thus, it is like a 
Moore machine.) The state types are start (S) and end (E); wait (W), in 
which the transducer waits for input; match (M) and delete (D), which process 
input symbols; insert (/), which writes additional output symbols; and null (A), 
which has no associated input or output. All transitions also fall into one of 
these types, via the destination states; thus, tm is the set of transitions ending 
in a match state, etc. The transition weight (W) factors into a term that is 
independent of the input /output label (W trans ) and a term that is independent 
of the source state (W emit ). 

Universality: For any weak-normal form transducer T there exists an equiva- 
lent in strict-normal form which can be found by applying the state-marginalization 
algorithm to eliminate null states. For any transducer, there is an equivalent 
letter transducer in weak normal form, and therefore, in strict normal form. 

3.3 Moore and Mealy machines 

The following terms in common usage relate approximately to our definitions: 
Mealy machines are transducers with I/O occurring on transitions, as with 

our general definition of the letter transducer. 

Moore machines are transducers whose I/O is associated with states, as with 

our normal form. The difference between these two views is illustrated via a 
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small example in Figure [6] and Figure [5j 



3.4 Composition (TU) unifies output of T with input of U 

Transducer composition: Given letter transducers T — (fix-, Oy, fys, 4>e-, t, W) 
and U = (0^,Ox,$ / ,^g,^,r',VV'), there exists a letter transducer TU — 
(Sl x ,n z ,$" ■■•W") such that Vie G fl* x ,z£ Q* z : 

W"(x : [TU] :z)= ^ W ( x : t T ] : ^) W '(^ : i U ] : z ) 

Example construction: Assume without loss of generality that T and U are 
in strict normal form. Then $" C $ x 0'g = (<jf> s , </>' s ), 0^ = (</>£, tj>' B ) and 

W , ((t,«),w x ,a;„(t / ,u / )) = 

' 5(t = tf)5(u x = e)W'(u,e,u z ,u') if typc(u) ^ VF 

W(t, cj x , e, t)5(u z = e)5(u = u') if type(u) = W, type(t') £ {AT, 1} 

W(t,w x ,LJ y ,t')W(u,u y ,u z ,u') iftype(u) = W,type(t')6{M,Z} 

otherwise 

The resulting transducer is in weak-normal form (it can be converted to a strict- 
normal form transducer by eliminating null states). 

In the tutorial section, many examples of simple and complex compositions 
are shown in Subsection |2.3[ for instance Figure [24} Figure [26] and Figure [3lj 



3.5 Intersection (To U) unifies input of T with input of U 

Transducer intersection: Given letter transducers T = (Ox, firi ^, ( t ) S^ <j>E, T i W) 
and U = (OxiO£/ ) $ / ,^g,^)g,r / ,VV / ), there exists a letter transducer ToU — 
(ftx,£V,$". . .W") where Oy C (0 T U{e}) x (n v U{e}) such that Vx e^,te 

W(a: : [T] : t)W'(x : [U] : u) = W(x : [To U] : (t,u)) 
where the term on the right is defined as follows 

W"(x:[ToU]:{t,u))= ^ W"(x : [T o U] : v) 

v£Qy ,Si (v)—t,S2(v)—u 

Here fly is the set of all possible pairwise alignment columns, v G fl' v is a 
pairwise alignment and Si(v) and 5*2 (v) are the sequences in (respectively) the 
first and second rows of v. 

Example construction: Assume without loss of generality that T and U are 
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in strict normal form. Then $" C $ x <!>', <j)'g — (4>s,(j>' s ), (\>" E — {^eA'e) an d 

W'((t,u),u x ,(u y ,u s ),(lf,v!)) = 

5{t = t')5{uj x =u) y = e)W{u, e, lo z , u') if type(u) ^ W 

W(t, e, u x , t')8{ui x =u z = e)5(u = u') if type(u) = W, type(t) ^ W 

W(t,u x ,u) y ,t')W(u,u x ,L) z ,u') if type(i) = type(u) = W 

otherwise 

The resulting transducer is in weak-normal form (it can be converted to a strict- 
normal form transducer by eliminating null states). In the tutorial section, many 
examples of simple and complex intersections are shown in Subsection |2.5[ for 
instance Figure [32] and Figure [35] 

3.6 Identity and bifurcation transducers (I, T) 

Identity: There exists a transducer I — (ft, ft . . .) that copies its input identically 
to its output. An example construction (not in normal form) is 

i = (ft, ft, 0,0,^,1) 

rx = {{(j>,u),u},<f>) : lo e ft} 
Bifurcation: There exists a transducer T = (ft, ft 2 . . .) that duplicates its in- 

/ X\ \ ( X2 \ / X3 

put in parallel. That is, for input x 1X2X3 ... it gives output I ^ J I a; Jix 
An example construction (not in normal form) is 

T = (ft,ft 2 ,{0},0,0,T T ,l) 

r T = I (d),u, ( ^ ^ ,<p\ : uj e ft 

It can be seen that T = I o I. 

An intersection T o U may be considered a parallel composition of T with T 
and U. We write this as T(T, U) or, diagrammatically, 



T U 



We use the notation T(T, U) in several places, when it is convenient to have 
a placeholder transducer T at a bifurcating node in a tree. 

3.7 Exact-match recognizers (V(S')) 

Recognition profiles: A transducer T is a recognizer if it has a null output al- 
phabet, and so generates no output except the empty string. 



Exact-match recognizer: For 5 e fi*, there exists a transducer V(S') = 
(Q, . . . W) that accepts the specific sequence S with weight one, but rejects all 
other input sequences 

W(x : [V(5)] : e) = S(x = S) 

Note that V(S) has a null output alphabet, so its only possible output is the 
empty string, and it is a recognizer. 

In general, if T = (fix,fiy . . . W) is any transducer then Vx £ tt* x , y g £ly 

W'(x : [T] : y) = W(x : [TV(y)] : e) 

An example construction (not in normal form) is 

V(5) - (O,0,Z len(s)+1 ,O,len(5),rv,l) 

r v = |(n,symbol(5,n + l),e, n + 1) : n G Zj en ^J} 

where Zjv is the set of integers modulo N, and symbol^, k) is the fc'th position 
of 5 (for 1 < k < len(S')). Note that this construction has len(S') + 1 states. 
For later convenience it is useful to define the function 

*V(5)(M) = K T w(hj) 
= 5(i + l = j) 

Figure |4] shows a small example of an exact-match transducer for sequence 
LIV, while Figure [10] shows an equivalent exact-match transducer in normal 
form. 

3.8 Generators 

Generative transducers: A transducer T is generative (or "a generator ) if it has 
a null input alphabet, and so rejects any input except the empty string. Then 
T may be regarded as a state machine that generates an output, equivalent to 
a Hidden Markov Model. Define the probability (weight) distribution over the 
output sequence 

P{x\T) = W(e : [T] : x) 

Figure [9] and Figure [T7] are both examples of generative transducers. Fig- 
ure [9] is a specific generator that only emits one sequence (with probability 1), 
while Figure [l7] can potentially emit (and defines a probability for) any output 
sequence. 

3.9 Algorithmic complexities 

|*riH = 0(\$ T \\3>u\) 
\$ ToU \ = 0(|$ T ||$t/|) 
|$ v(s) | = 0(\en(S)) 
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The complexity of computing W(x : [T] : y) is similar to the Forward algo- 
rithm: the time complexity is C(|rr|len(a;)len(?/)) and the memory complexity is 
C(|$t| mm (^ en (x), len(y))). Memory complexity rises to (9(|$y|len(:c)len(y)) if 
a traceback is required. Analogously to the Forward algorithm, there are check- 
pointing versions which trade memory complexity for time complexity. 



3.10 Chapman-Kolmogorov equation 

If T t is a transducer parameterized by a continuous time parameter i, modeling 
the evolution of a sequence for time t under a continuous-time Markov process, 
then the Chapman-Kolmogorov equation |32j can be expressed as a transducer 
equivalence 

TtT u = T t+U 

The TKF91 transducers, for example, have this property. Furthermore, for 
TKF91, T t +u has the same number of states and transitions as T t , so this is a 



kind of self-similarity. TKF91 composed with itself is shown in Figure 24 

In this paper, we have deferred the difficult problem of finding time-parameterized 
transducers that solve this equation (and so may be appropriate for Felsenstein 
recursions). For studies of this problem the reader is referred to previous work 
[2313311171 mm]. 



3.11 Hierarchy of phylogenetic transducers 

3.11.1 Phylogenetic tree (ri, C) 

Suppose we have an evolutionary model defined on a rooted binary phylogenetic 
tree, and a set of k observed sequences associated with the leaf nodes of the tree. 

The nodes are numbered in preorder, with internal nodes (1 . . . k — 1) pre- 
ceding leaf nodes C — {n ... 2k — 1}. Node 1 is the root. 

3.11.2 Hidden and observed sequences (S n ) 

Let S n £ fi* denote the sequence at node n and let T> = {S n , n E £} denote the 
observed leaf-node sequences. 



3.11.3 Model components (B n , R) 

Let B n = (ft, fl, . . .) be a transducer modeling the evolution on the branch to 
node n > 1, from n's parent. Let R = (0, f2, . . .) be a generator modeling the 
distribution of ancestral sequences at the root node. 

Figure [TT] and Figure [19] are examples of B n transducers. Figure [18] is an 
example of an R transducer. 
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3.11.4 The forward model (F n ) 

If n > 1 is a leaf node, define F n = I. Otherwise, let (l,r) denote the left and 
right child nodes, and define 

F n = {BiFi) o (B r F r ) 
which we can represent as (recall that T is the bifurcation 



T 




Fi F r 




transducer) . 

The complete, generative transducer is F a — RFi 

The output alphabet of F is (fi U {e}) K where k is the number of leaf 
sequences. Letting S n : r* — > ft* denote the map from a transition path n 
to the n'th output leaf sequence (with gaps removed), we define the output 
distribution 

P(V\F ) = W{e : [F ] :V)= ^ W(w) 

w.Sn (7r)=(S n VnG«Cn 

where C n denotes the set of leaf nodes that have n as a common ancestor. 

Note that I^Fol — Ilri K ~ 1*^5^1 where 2k — 1 is the number of nodes in the 
tree. So the state space grows exponentially with the size of the tree — and this 
is before we have even introduced any sequences. We seek to avoid this with our 
hierarchy of approximate models, which will have state spaces that are bounded 
in size. 

First, however, we expand the state space even more, by introducing the 
observed sequences explicitly into the model. 

3.11.5 The evidence-expanded model (G„) 

Inference with stochastic grammars often uses a dynamic programming matrix 
(e.g. the Inside matrix) to track the ways that a given evidential sequence can 
be produced by a given grammar. 

For our purposes it is useful to introduce the evidence in a different way, by 
transforming the model to incorporate the evidence directly. We augment the 
state space so that the model is no longer capable of generating any sequences 
except the observed {S n }, by composing Fq with exact-match transducers that 
will only accept the observed sequences. This yields a model whose state space is 
very large and, in fact, is directly analogous to the Inside dynamic programming 
matrix. 
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If n > 1 is a leaf node, define G n = V(«S„). The number of states is 
|$ G J = 0(len(«S„)). 

Otherwise, let (I, r) denote the left and right child nodes, and define 

G n = (Bid) o (B r G r ) 



which we can represent as 



Bi B r 



Gi G r 



Figure [34] and Figure 42 are examples of G„-transducers for the tree of 
Figure [T] 

The complete evidence-expanded model is Go = RG±. (In our tutorial ex- 
ample, the state graph of this transducer has too many transitions to show, but 
it is the configuration shown in Figure [2]) 

The probability that the forward model Fq generates the evidential sequences 
T> is identical to the probability that the evidence-expanded model Go generates 
the empty string 

P(V\F ) = W(e : [F ] : V) = W(e : [G ] : e) 
Note the astronomical number of states in Go 




This is even worse than Fq] in fact, it is the same as the number of cells in the 
Inside matrix for computing P(V\F ). The good news is we are about to start 
constraining it. 



3.11.6 The constrained-expanded model (H n , E n , M n , Q n ) 

We now introduce a progressive series of approximating constraints to make 
inference under the model more tractable. 

If n > 1 is a leaf node, define H n = V(iS„) = G„. The number of states is 
|$ff n | — len(5„), just as with G„. 

Otherwise, let (I, r) denote the left and right child nodes, and define 

H n = (BiEi) o (B r E r ) 

where Q &H„- 
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We can represent H n diagramatically as 



Bi B r 
Ei E r 

Figure [34} Figure [43] and Figure [44] are examples of H n transducers. 

Transducer E n , which is what we mean by the "constrained-expanded model" , 
is effectively a profile of sequences that might plausibly appear at node n, given 
the observed descendants of that node. Figure [38] is an example of such trans- 
ducers for the "intermediate sequence" in the tree of Figure [I] (Figure 37 and 



Figure 40 show the relationship to the corresponding H n transducers). 
The profile is constructed as follows. 

The general idea is to generate a set of candidate sequences at node n, by 
sampling from the posterior distribution of such sequences given only the 
descendants of node n, ignoring (for the moment) the nodes outside the n- 
rooted subtree. To do this, we need to introduce a prior distribution over the 
sequence at node n. This prior is an approximation to replace the true (but as 
yet unknown) posterior distribution due to nodes outside the n-rooted subtree 
(including n's parent, and ancestors all the way back to the root, as well as 
siblings, cousins etc.) 

A plausible choice for this prior, equivalent to assuming stationarity of the 
underlying evolutionary process, is the same prior that we use for the root node; 
that is, the generator model R. We therefore define 

M n = RH n 

= R((BiEi) o (B r E r )) 

We can represent M n diagramatically as R 



Bl By 



Ei E r 



Figure 35 is an example of the M n type of model. 



The transducer Q n = R(Bi o B r ), which forms the comparison kernel of M„, 



G3 



is also useful. It can be represented as R 

T 

Bi B r 



Conceptually, Q n is a generator with two output tapes (i.e. a Pair HMM). 
These tapes are generated by sampling a sequence from the root generator R, 
making two copies of it, and feeding the two copies into B\ and B r respectively. 
The outputs of Bi and B r are the two outputs of the Pair HMM. The different 
states of Q n encode information about how each output symbol originated (e.g. 
by root insertions that were then matched on the branches, vs insertions on the 
branches). Figure 46 shows an example of a Q„-like transducer. 

Transducer M n can be thought of as the dynamic programming matrix that 
we get if we use the Pair HMM Q n to align the recognition profiles Ei and E r . 

3.11.7 Component state tuples (p,v,bi,ei,b r ,e r ) 

Suppose that a G &A,b G &b,v G Our construction of composite trans- 
ducers allows us to represent any state in A o B — T(A, B) as a tuple (v, a, b). 
Similarly, any state in AB can be represented as (a, 6). Each state in M n can 
thus be written as a tuple (p, v, bi,ei, b r , e r ) of component states, where 

• p is the state of the generator transducer R 

• v is the state of the bifurcation transducer T 

• bi is the state of the left-branch transducer B\ 

• ei is the state of the left child profile transducer Ei 

• b r is the state of the right-branch transducer Bi 

• e r is the state of the right child profile transducer E r 

Similarly, each state in H n (and E n ) can be written as a tuple (u, bi, e;, b r , e r ). 

3.11.8 Constructing E n from H„ 

The construction of E n as a sub-model of H n proceeds as follows: 

1. sample a set of K paths from P(vr|M„) = W(n)/W(e : [M n ] : e); 

2. identify the set of M„-states {(/?, v, bi, e;, 6 r , e r )} used by the sampled 
paths; 

3. strip off the leading p's from these M ra -states to find the associated set of 
ff n -states {(v, bi,ei,b r , e r )}; 
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4. the set of -ff„-states so constructed is the subset of En's states that have 
type D (wait states must be added to place it in strict normal form). 

Here K plays the role of a bounding parameter. For the constrained- 
expanded transducer, |$eJ ~ KL, where L = max„ len(<S„). Models H n and 
M n , however, contain 0(b 2 K 2 L 2 ) states, where b = max„|$ B J, as they are 
constructed by intersection of two 0(bKL)-st&te transducers (BiEi and B r E r ). 

3.12 Explicit construction of Q n 

Q n = R(Bi o B r ) 

= (0, (fl U {e}) 2 , $Q„ , S ;Q„ , <f>E;Q n ,T Qn ,W Qn ) 
4>S;Q n = {(t>S;R,4>S;T,(t>S;B l ,4>S:B r ) 
<i>E;Q n = {<i>E;R,4>E;r,<i>E;B n 4>E;B T ) 

3.12.1 States of Q n 

Define type(0i, 2 , <fo . . .) = (type(^i), type(<^ 2 ), type(</> 3 ) . . .). 

Let q = (p,v,bi,b r ) G 3>q„- We construct $q„ from classes, adopting the 
convention that each class of states is defined by its associated types: 

$ class = U : type(p, v, h, b r ) e T class } 
The state typings are 

{(/,M,M,M)} 
{(/, M, M, £>)} 
{(/,M,AM)} 
{(J, M, £>,£>)} 

{(5,5,5,7), (I,M,M,I), (I,M,D,I)} 
{(S,S,I,W), (I,M,I,W)} 
{{W,W,W,W)} 

^left-del U ^right-ins 
^left-ins U ^right-del 

The state space of Q n is 

*Q„ = {0S;Q„, <t>E;Q n } U $ matc h U $i e ft-emit U $ right-cmit U $ null U $ wait 

It is possible to calculate transition/emission weights of Q n by starting with 
the example constructions given for TU and T o U, then eliminating states 
that are not in the above set. This gives the results described in the following 
sections. 



Tmatch — 

^right-del = 

^left-del = 

Tnull = 

^right-ins = 

^left-ins = 

^wait = 

right-emit — 

^left-emit — 
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3.12.2 Emission weights of Q n 

Let (u)i,w r ) € (0 U {e}) 2 . 

The emission weight function for Q n is 



war t (e,(^,c I .),< z ) = <; 



f EC t M. iJ )C t (^ 1 ,Mwf(^ r ,M ifge^match 

£ WT'fe u, i?)W§ mit ( W , Wr) 6,) if g G $ left . del 

£ W^ie^^Wlf^iuj^h) itqe $ right . del 



wI mit (6, Wi ,&o 

W| mlt (e, Wr ,6 r ) 
1 



if 1 € *left-ins 
if Q € $right-ins 
otherwise 



3.12.3 Transition weights of Q„ 

The transition weight between two states q = (p, v, bi, b r ) and q' = (/?', v', fy, b' r ) 
always takes the form 

W^ as (q, q') = W^^in^W^^i^W^^B^W^^B^) 

where Wq^ ans ({ttt}) represents a sum over a set of paths through component 
transducer T. The allowed paths {itt} ar e constrained by the types of q, q' as 
shown in Table [5] Table [5] uses the following conventions: 

• A in any column means that the corresponding state must remain un- 
changed. For example, if \kbi I = then 

W^ ans ({7r s J) = 5{b l =bf l ) 

• A 1 in any column means that the corresponding transducer makes a single 
transition. For example, if |7Tb,| = 1 then 

• A 2 in any column means that the corresponding transducer makes two 
transitions, via an intermediate state. For example, if 17Tb, | = 2 then 

W^ S ({7T B! })= E WB^ibub'DW^WX) 

v l 'e<s>B l 

(Since the transducers are in strict normal form, and given the context 
in which the 2's appear, it will always be the case that the intermediate 
state b'l has type W .) 

• An asterisk (*) in a type-tuple is interpreted as a wildcard; for example, 
(5, S, S, *) corresponds to {(S, S, S, S), (S, S, S, I)}. 

• If a transition does not appear in the above table, or if any of the yy trans 's 
are zero, then p(h, u), uj' , h!) G th„- 
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type(p,u, b u b r ) 


type(p ,u ,oJ,6 r j 


1 1 

Fill 


1 1 

Ft I 


1 1 


1 1 


(S, S, S, *) 


/ Q Q Q T\ 
1 J, J, J, 1 1 

{OCT TT7"\ 

(o, o, i , VVJ 
(J, M, M, M) 


U 

1 


u 


2 


u 

1 

2 


1 
1 

1 

2 




(I, M, M, L>) 


1 


2 


2 


2 




/ 7" 71/1" 7~» 71 /T\ 

(i, M, M) 


i 

1 


2 


2 


2 




(I,M,D,D) 
(W,W,W,W) 


1 
i 

1 


2 
1 


2 
1 


2 
1 


(5,5,7, W) 


^G V T TI/\ 

(D, O, -/ , KK J 

\1,M,M,M) 
(J, M, M, D) 


U 
i 

1 
1 


u 

2 
2 


1 

2 
2 


u 

1 

1 




(I, M, £>, M) 


1 


2 


2 


1 




( T 71/1" 7~1 TW 

(J, M, D, D) 


1 


2 


2 


1 

1 




{~\J7~ TIT" TT7" TJ/^ 

(W,W,W,W) 


1 


1 


1 





(I, M, M, *) 


( T 71 , J" 71 , J" 7"\ 

(J, M, M, J) 











1 




(1,M,1, W) 
(I, M, M, M) 
(I, M, M, D) 


U 
1 
1 


U 

2 
2 


1 

2 
2 


1 

2 
2 




(I, M, D, M) 

( 7" 71 /T 7~"1 7~1 \ 

(J, M,D,D) 


1 
1 


2 
2 


2 

1 


2 
2 




(~i~KT TIT" TT7" TJ/^ 

( lr , H 7 , W , vr j 


1 


1 


1 


1 




/ t 7i /f r~i 7"~\ 
{1,M,D,1) 


U 


u 


u 


1 




(T M T wn 

(J, M, M, M) 
(I, M, M, D) 


n 

u 
1 
1 


n 

u 

2 
2 


1 

± 

2 
2 


i 

_L 

2 
2 




(I, M, L>, M) 


1 


2 


2 


2 




(I,M,D,D) 


1 


2 


2 


2 




(W, W, W, W) 


1 


1 


1 


1 


(J,M,J, 


(I,M,I,W) 
(I, M, M, M) 
(I, M, M, D) 
(I, M, D, M) 




1 
1 
1 




2 
2 
2 


1 

2 
2 
2 



1 
1 
1 




(I,M,D,D) 


1 


2 


2 


1 




(W, W, W, W) 


1 


1 


1 





(W, W, W, W) 


(E, E, E, E) 


1 


1 


1 


1 



Table 5: Transition types of Q n , the transducer described in Subsection |3.12| 
This transducer requires its input to be empty: it is 'generative'. It jointly 
models a parent sequence (hidden) and a pair of sibling sequences (outputs), 
and is somewhat analogous to a Pair HMM. It is used during progressive recon- 
struction. 
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3.12.4 State types of Q n 



type(g) 



S ifq = <f> S ;Q n 

E if q = 4>E;Q n 

W if q e $ wait 

1 if 1 e $ match U $ left-emit U *right-emit 

N if 1 e $ null 



Note that Q n contains null states ($ nu n) corresponding to coincident dele- 
tions on branches n — > Z and n — > r. These states have type(p, u, 6;, 6 r ) = 
(7, M, D, £>). There are transition paths that go through these states, including 
paths that cycle indefinitely among these states. 

We need to eliminate these states before constructing M n . Let Q' n = Q n 
denote the transducer obtained from Q n by marginalizing <& nu u 

®Q' n ={<f>S;Q n , 4>E;Q n ] U $ matc h U ^left-emit U $ right-emit U *wait 

The question arises, how to restore these states when constructing E n l Or- 
theus samples them randomly, but (empirically) a lot of samples are needed 
before there is any chance of guessing the right number, and in practice it 
makes little difference to the accuracy of the reconstruction. In principle it 
might be possible to leave them in as self- looping delete states in E n , but this 
would make E n cyclic. 

3.13 Explicit construction of M n using Q' n , Ei and E r 
Refer to the previous section for definitions pertaining to Q' n . 

M n = R((BiEi) o (B r E r )) 
Q' n = R(Bi o B r ) 



3.13.1 States of M n 

The complete set of M„-states is 

U {(p,v,bi,ei,b r ,e r ) 
U {(p,v,bi,ei,b r ,e r ) 
U {(p,v,bi,ei,b r ,e r ) 
U {(p,v 7 b h ei,b r ,e r ) 



(p, v, b u b r ) E $ mat ch' tv P e ( e i) = type(e r ) = D} 
(p,v,bi,b r ) e ^left-emit! tyP e ( e i) = £>,type(e r ) = 
{p,v,bi,b r ) e ^right-emit' tv Pe(e;) = W r ,type(e r ) 
(p, v, b t ,b r ) e $ wa it, type(e ( ) = type(e r ) = W} 



3.13.2 Emission weights of M n 

Let to = (p,v,bi,ei,b r ,e r ) be an M„-state and q = (p,v,bi,b r ) the subsumed 
Q^-state. 

Similarly, let to' = (//, v' , b[, e\,b' r ,e' r ) and q' = (p 1 , v', b[,b' r ). 
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The emission weight function for M n is 



W e ™ t (e,e,m)={ 



' E E WQ?He, ("I, Wr), q).W%?*(ui,e, eO-Vl^V. e, e r ) if « € $ match 
£ W|? dt (c,(a; I , e ) ) g).>V|? nit (a; J) 6,eO 
£ W§f *(£, (e, uv), 9).Wg mit K, e, e r ) 

w r £fi 



if 9 e $left-emit 
if 9 € $ r ight-emit 



otherwise 



3.13.3 Transitions of M n 

As before, 



q = 

m = 

q' = 

m! = 



(p,v,b t ,b r ) 

(p,v,bi,ei,b r ,e r ) 

(p'V,W) 

[fl ,v! XiXuKX) 



An "upper bound" (i.e. superset) of the transition set of M n is as follows 
tm„ C {(m,e,e,m') : g' € Snatch > tyP^ 1 ?) € {S,I},type(e' h e' r ) = (D,D)} 



U {(m, e, e, m') 
U {(m, e, e, m') 
U {(m, e, e, m') 
U {(m, e, e, m') 



-emit i 

type(g) € {S, J}, type(ej, e' r ) = (£>, W)} 
9 G $ right-emit ' 

type(g) G {5, /}, type(e{, e' r ) = (W,D)} 
q' G *wait.tyPe(9) G {5,7},type(e^ e ;) = (W, W)} 
type(g, ej) e r ) - (W,W,W),typetfXX) = (E,E,E)} 



More precisely, tm„ contains the transitions in the above set for which the 
transition weight (defined in the next section) is nonzero. (This ensures that 
the individual transition paths q — >• q' , e; — >• e\ and e r ~ > e' r exist with nonzero 
weight.) 



3.13.4 Transition weights of M n 

Let Wj^ a " walt (e, e') be the weight of either the direct transition e — > e', or a 
double transition e — >• e" — > e' summed over all intermediate states e" 



W 



via-wait 



(e, e') = < e "e* 



2 W^ anS (e,e'')W|f ns (e'',e') if type(e) G {S, D} 



w 



trans 



(e,e>) 



if type(e) = W 



Let W^? walt (e, e') be the weight of a transition (or non-transition) that 
leaves E n in a wait state 



W|f ns (e, e') if type(e) e {5, 1?}, typc(e') - W 
1 if e = e', t 

otherwise 



W to-wait (e5 e ') = <; ! if e = e / ; type(e /) = w 
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The transition weight function for M n is 

if <t G $ match 
if q' e $ieft-emit 
if <?' S $ r ight-emit 

if 4 G *wait 
if q' = B;Q ; 

3.14 Explicit construction of i7 n 

This construction is somewhat redundant, since we construct M n from Q n , Ei 
and E r , rather than from RH n . It is retained for comparison. 

H n = o (B r E r ) 

= {£l,<b,$H n ,<t>S;H n ,<f>E;H n ,TH n ,W Hn ) 

Assume Bi,B r ,Ei, E r in strict-normal form. 
3.14.1 States of H n 

Define type(0i, 2 , <j>3 ■ ■ ■) = (type(^i),type(<^ 2 ),type(^ 3 ) . . .). 

Let h — (v,bi,ei,b r ,e r ) e 3>£f„- We construct from classes, adopting 
the convention that each class of states is defined by its associated types: 

$ class = i h : type(v,bi,ei,b r ,e r ) e T class } 

Define $ ex t C to be the subset of i?„-states that follow externally- 

driven cascades 

T ext = {(M,M,D,M,D), (M,M,D,D,W), 
(M, D, W, M, D), (M,D,W,D,W)} 

Define $i nt C to be the subset of if„-states that follow internal cascades 

Mnt — ^left-int ^ ^right-int 
Tleft-int = {(S,I,D,W,W), (M,I,D,W,W)} 
7" r ight-int = {(S,S,S,I,D), (M,M,D,I,D), (M,D,W,I,D)} 

Remaining states are the start, end, and wait states: 

4>S;H n = {4>S;r,<t>S;B,,4>S;E,,4>S;B r ,(t>S;E r ) 
4>E;H n = {<t>E;T,<t>E;B,,<t>E;E l ,(t > E;B T ,<f>E;E r ) 

Twait = {(W,W,W,W,W)} 
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W™(m, m') - W£, ans ( 9 , q') X { 



' w via-wait (e;j e / )w via-wait (er; e g 
w via-wait (e;; e , )w to-wait (er; g ^ 
W to-wait (g;; e , )w vm r -wait (er; e ^ 
w to-wait (e;j e;)w to-wait (er; e , ) 

. WK anS (e ; ,e;)W|. ans (e r ,e;) 



The complete set of H„-states is 

} U $ext U $ int U $ wait 

It is possible to calculate transition/emission weights of H n by starting with 
the example constructions given for TU and T o U, then eliminating states 
that are not in the above set. This gives the results described in the following 
sections. 

3.14.2 Emission weights of H„ 

Let uj,oj' 6(UU {e}). 

Let C n (b n , e„) be the emission weight function for B n E n on a transition into 
composite state (6„,e„) where type(&„,e„) = (I,D) 

C n (b n ,e n ) = W% r f t (e,u>,b n )W%^ it (iJ,e,e n ) 

Let D n {oj, b n , e n ) be the emission weight function for B n E n on a transition 
into composite state (b n , e n ) where type(6 n , e„) e {(M, D), (D, W)} with input 
symbol ui 

\ E Wl™V,^MW^V^e n ) iftype(6 n ,e n ) = (M,£>) 
D n (u, b n , e„ ) = < u'en 

[ WgfHu, K) if type(b„, e„) = (D, W) 

The emission weight function for H n is 

!Di(uj,bi,ei)D r (uj,b r ,e r ) if h e $ e xt 

Q(bi,ei) if ft e $i eft _ int 

a(6,,e r ) if^$ right _ int 

1 otherwise 

3.14.3 Transition weights of _ff„ 

The transition weight between two states h — (v,bi,ei,b r ,e r ) and h! = (v' , b[, e[, b' r , e' r ) 
always takes the form 

W%™ s (h, h!) = W^ s ({7r r }).W*™ ns ({7^ 

where the RHS terms again represent sums over paths, with the allowed paths 
depending on the types of h, hi as shown in Table [6j Table [6] uses the same 
conventions as Table 03 



3.14.4 State types of H n 



type(/i) = < 



S if h = 4>s-,H n 

E if h = 4>E-,H n 

W if^G* wait 

D if he $ext 

[ N if he $ int 
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type(t), bi,ei,b r ,e r ) 


type(u ,0|,e i3 o r ,e r ; 




1^ 1 


1^ 1 


FBrl 




( Q Q Q Q Q\ 
I J, O, O, O 1 


(b,b,b,I,D) 


(J 





u 


1 


2 




{ C T T^i JUT TXT\ 

{S,I,D, W,W) 





1 


2 


1 


1 




(JXT TT/ TT/ TT/" TJ/'N 

(W, vV, W,W,W) 


1 

1 


1 


1 


1 

1 


1 


/COOT 7~~l\ 

(6, 6, 6, J, D) 







(J 


u 


1 


2 




/ O 7" 7~* T T7" T J 7" \ 

(5,1, D, W, W) 





1 


2 


1 


1 




fJX/ TT/ TT/ TT/ TT/\ 

W, kk, kv, It J 


1 
1 


1 


-i 
1 


1 

1 


1 


/ C T T~\ TT7" TT7"\ 

{b,I,D,W,W) 


/ C 1 T 7~» TT/" TT7"\ 


(J 


1 


2 


(J 







/ti/ TT7" TIT" TT7" TI7"\ 

(Vv, Vy, vv, Vv, It) 


1 


1 


1 








/TJ/~ TT/" TT/ TT/" TT/\ 

(14 7 , It, W,W,W) 


(M, iW, i^, M, JJ) 


-1 

1 


1 


1 


-1 

1 


I 




l 71 /f 71 7~» 7~» TIM 

(M,M, D,D, W ) 


1 


1 


1 


1 







I 71 /T 7~* TT7" 71 if 7~l\ 

(M, L>, IV, M, L>) 


1 


1 





1 


1 




/ TV /-r n TT/ D \XT\ 
(^M , /v. vt, iJ. W ) 


1 
1 


1 
1 


U 


1 


u 




/ 77 77 77 77 77\ 

^ii/, xv, £/ J 


1 

1 


-1 
1 


1 


1 

1 


i 

i 


(M, M, D, M, D) 


/ 71 7" 71 /r 7~* r 7~l\ 

(M, M, D,J,D) 











1 


2 




(M, J , JJ, W, W ) 


u 


1 


o 
z 


1 

1 


1 




(~\~AT TUT TJ/ TJ^ T/f/*l 
KK, W,W,W) 


1 

1 


1 

1 


1 


1 


1 


/ 71 /T 71 /f T\ 7~"i TI 7"\ 

(M, M, D, D, W) 


(M, M, D,J,D) 











1 


1 




(iW, 1 , JJ, W, W ) 


u 


1 


o 
L 


1 

1 


U 




{~\~\T TAT TTiT TAl AXr\ 

(W, W, W,W,W) 


1 

1 


1 


1 


1 


U 


(M D W M D\ 


( m n w t n\ 


n 

u 


n 

u 


n 
u 


1 

L 



z. 




(M, I,D,W, W) 





1 


1 


1 


1 




(W, W, W, W, W) 


1 


1 





1 


1 


(M, £>, W, £>, W) 


(M, D, W, I, D) 











1 


1 




(M,I,D,W,W) 





1 


1 


1 







(W, W, W, W, W) 


1 


1 





1 





(M,M,D,I,D) 


(M, M,D,I,D) 











1 


2 




(M, I,D,W, W) 





1 


2 


1 


1 




(W, W, W, W, W) 


1 


1 


1 


1 


1 


(M,D,W,I,D) 


(M, D, W, I, D) 











1 


2 




(M,I,D,W,W) 





1 


1 


1 


1 




(W, W, W, W, W) 


1 


1 





1 


1 


(M,I,D,W,W) 


(M, I, D, W, W) 





1 


2 










(W, W, W, W, W) 


1 


1 


1 









Table 6: Transition types of H n , the transducer described in Subsection |3.14| 
This transducer requires non-empty input: it is a 'recognizing profile' or 'rec- 
ognizer'. It models a subtree of sequences conditional on an absorbed parental 
sequence. It is used during progressive reconstruction. 
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Since H n contains states of type N (the internal cascades), it is necessary 
to eliminate these states from E n (after sampling paths through M n ), so as to 
guarantee that E n will be in strict normal form. 

3.15 Explicit construction of M n using R and H n 

This construction is somewhat redundant, since we construct M n from Q n , E\ 
and E r , rather than from RH n . It is retained for comparison. 

The following construction uses the fact that M n = RH n so that we can 
compactly define M n by referring back to the previous construction of H n . In 
practice, it will be more efficient to prccomputc Q n = R(Bi o B r ). 

Refer to the previous section ("Explicit construction of H n v ) for definitions 
of $ cxt , $ int , $ wait , W|f ns (/i, h'),W%f(«>, «/, h). 

Assume that R is in strict normal form. 



M n = RH n 

= R({B l Ei)o{B r E r )) 

= (0,0, ^M n ,(t>S;M n ,4>E;M n ,TM n ,yVM n ) 

PS;Mn = (4'S-M,(t , S:r,4 , S;B l ,(t>S-,E l ,4>S:B r ,4>S;E r ) 

^E;M n — {<t > E;R,<t ) E;r,<t > E;B l ,4>E;Ei,<j>E;B r ,<l>E;E r ) 



3.15.1 States of M n 

The complete set of M„-states is 

U {(p,v 7 b h ei,b r ,e r ) 
U {(p,v,b h ei,b r ,e r ) 
U {(p, v,b h e h b r ,e r ) 
U {(p,v,b h ei,b r ,e r ) 



{v, bi,ei,b r , e r ) e $ e xt, type(p) = 1} 
(v,b h ei,b r ,e r ) <= $ wa it> type(p) = W} 
(v, b h e u b r , e r ) £ $ int , type(p) = type(u) = 
(v,b h ei,b r ,e r ) € $ int , type(p) = I, type(u) 



3.15.2 Emission weights of M n 

Let to = (p, v, bi,ei,b r ,e r ) be an M„-state and h = (v, bi,ei,b r ,e r ) the subsumed 
i?„-state. 

Similarly, let to' = (p', w', b[, e' l7 b' r ,e' r ) and /i' = (v' , b[, e[, b' r , e' r ). 
The emission weight function for M n is 

^W« mit (e )W ,p)W^ lit (a;,e,/ l ) if ft e $ 6 xt 

W|r mlt (e,e,/i) otherwise 
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3.15.3 Transition weights of M„ 

The transition weight function for M n is 



wX ans (m,m') 



Wfl anS (p,p') V W^iKh'^W^ih^h') if^'G^ext 



' l " e *wait 
;trans/„ „n-i/\;trans 



Wfl ans (p, p')Wh* us (K h') otherwise 



If Ei and E r are acyclic, then H n and E n will be acyclic too. However, M n 
does contain cycles among states of type (I, M, D, W, D, W). These correspond 
to characters output by R that are then deleted by both Bi and B r . It is 
necessary to eliminate these states from M n by marginalization, and to then 
restore them probabilistically when sampling paths through M n . 

3.16 Dynamic programming algorithms 

The recursion for W(e : [M n ] : e) is 

W(e:[M n ]:e) = Z(<f> E ) 

Z[m!) = Z(m)>V(m, e, e, m!) Vra' 7^ 05 

m:(m,e,e,m')£T 

z{4> s ) = 1 

The algorithm to fill Z(m) has the general structure shown in Algorithm [lj 
(Some optimization of this algorithm is desirable, since not all tuples (p, v, 6j, ej, b r , e r 
are states of M n . If E n is in strict-normal form its W- and -D-states will occur in 
pairs (c.f. the strict-normal version of the exact-match transducer V(S')). These 
(D,W) pairs are largely redundant: the choice between D and W is dictated 
by the parent B n , as can be seen from Tableland the construction of 



Initialize Z((f>s) <— 1; 










foreach e; <G do 




/* topologically- 


-sorted 


*/ 


foreach e r g $£ r do 




/* topologically- 


-sorted 


*/ 


foreach (p,v,bi,b r ) € <&q 


, do 


/* topologically- 


sorted 


*/ 


Let to = (p,v,b h e h b r , 


6 r ), 








if to g $m„ then 










Compute Z(m)\ 










Return Z{$e)- 











Algorithm 1: The analog of the Forward algorithm for transducer M, 



described in Subsection |3.13| This is used during progressive reconstruc- 
tion to store the sum-over-paths likelihood up to each state in &m„ ■ The 
value of Z{(j) E .) is the likelihood of sequences descended from node n. 
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For comparison, the Forward algorithm for computing the probability of two 
sequences (Si, S r ) being generated by a Pair Hidden Markov Model (M) has the 
general structure shown in Algorithm [2j 



Initialize cell (0,0, START); 






foreach < ii < len(Si) do 


/* ascending order 


*/ 


foreach < i r < len(S r ) do 


/* ascending order 


*/ 


foreach a E M do 


/* topologically-sorted 


*/ 


Compute the sum-over-paths up to cell (ii,i r ,a); 




Return cell (\en(S t ), \en(S r ), END). 







Algorithm 2: The general form of the Forward algorithm for computing 



the joint probability of two sequences generated by the model M, a Pair 
HMM. 

The generative transducer Q n = R(Bi o B r ) in Algorithm [T] is effectively 
identical to the Pair HMM in Algorithm [2j 

3.17 Pseudocode for DP recursion 

We outline a more precise version of the Forward-like DP recursion in Algo- 
rithm [3] and the associated Function sumjpathsjto. Let get_state_type(g, side) 
return the state type for the profile on side which is consistent with q. 

3.17.1 Transition sets 

Since all transitions in the state spaces Q' n , Ei, and E r are known, we can define 
the following sets : 

incoming_left_profile_indices(j) = {i : ti(i,j) =/= 0} 

incoming_right_profile_indiccs(j) = {i : t r (i,j) =/= 0} 

incoming_match_states(g') = {q : q e $ matc h> W%? ns (q, q') ^ 0} 

incoming_left_emit_states(g') = {q : q E $i c ft_ em it' ^Q'^iQi Q') ^ °} 

incoming_right_emit_states(g'') = {q : q E $ r ight-emit ' WQ/ ans (q, 17') 7^ 0} 



3.17.2 Traceback 

Sampling a path from P(ir\M n ) is analogous to stochastic traceback through the 
Forward matrix. The basic traceback algorithm is presented in Algorithm[5j and 
a more precise version is presented in Algorithm [6j 

Let the function sample(set, weights) input two equal-length vectors and 
return a randomly-chosen element of set, such that seti is sampled with proba- 
bility — weightsi ^ s {. a .j- e p^j^ w jth two sampled paths is shown in Figure 

J sum(weignts) 1 sr sr o 
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Initialize Z{4>s) 1; 

foreach 1 < i' r < N r do 

foreach q' e {q : type(q) = (S, S, S, I)} do 

Let {e' l ,e' r ) = {<t>s,<t>f ) ); 
sumjpaths jto{q' , e'i,e' r ); 
foreach 1 < i', < Ni do 
foreach 1 < i' r < N r do 

if is-iri-envelo'pe(j!i 1 i! r ) then 
foreach q 1 € <$> match do 

Let (e;,e;) = (^ ) ,^ ) ); 
sumjpaths jto(q' , ej, e' r ); 
foreach q' € $i e f t _ emit do 

Let (e;,e;) = (^ ;) ,^ ) ); 
sumjpaths Jo(q' , ej, e^.); 
foreach <?' € ^right-emit do 

if type(g') == (5, S, S, J) then 

continue 
Let t = get_state_type(g', left); 

(e' l ,e' r ) = (4 l ''\^ ) ); 
sumjpaths jto{q' , e'^e' r ); 
foreach q' € <5> wait do 

Let {e' l ,e' r ) = {4> ( § ) A { § ) )- 1 
sumjpaths -to(q' , ej, e' r ); 
foreach 1 < i[ < Ni do 

foreach q> e § left- emit do 
Let ( e ;,<) = (<^\<^ d >); 
sumjpaths -to{q' , ej, e^.); 
foreach 1 < i' r < N r do 

foreach g' e $ right-emit do 
Let ( e j, e ;) = (^ nd) ,^ ) ); 
sumjpaths jto(q' , ej, ej,); 
foreach g' e $ warf do 

Let ( e ;, e ;) = (^ nd) ,C d) ); 

sumjpaths -to(q' ', ej, e^.); 
Initialize Z((/)e) 0; 
foreach g e $ walt do 

Let m = (g,<^ nd) ,<^ nd) ); 

«- + Z(m)W^ ans (g,^ ;Q J; 



Algorithm 3: The full version of the analog of the Forward algorithm for 
transducer M n , described in Subsection |3.13| This to visit each state in 
$m„ in the proper order, storing the sum-over-paths likelihood up to that 
state using sum_paths_to(. . . ) (defined separately). The value of Z{<f>E-) is 
the likelihood of sequences descended from node n. 
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Input: (q' ,e[,e' r ). 

Result: The cell in Z for to — (q' , e[, e' r ) is filled. 

Let to' = (q',e' l ,e' r ); 
Let S = W^ it {e,e,m'); 

Initialize Z{rri) 4- W^ ans (>s, q%(0, ij)^(0,^)f ; 
foreach € incoming Jeft-profile-indicesii't) do 

foreach i r € incoming_right_profile-indices(i' r ) do 
Let ( e; ,e r ) = (^ ) ,^" ) ); 
foreach g G incoming -matchstates{q') do 
Let to = (g, e;,e r ); 

Z(m') <- Z(ro') + Z(TO)W^ ans (g j(Z ')^^,^)^^^;)f 
foreach ii € incoming_lcft_profile_indiccs(i'i} do 
foreach g G incoming _left_emit_states(q') do 

Let (e / ,e r ) = (^ i) ,^ ) ); 
Let to = (g, ej, e r ); 

z(to') <- z(m') + z(m)w^ s ( g ,</)**M)£; 

foreach i r G incoming_right_profile_indices{i' r ) do 
foreach g G incoming _right_emit_states(q / ) do 
Let r = get_state_type(g, left); 

Let ( ei ,e r ) = (# ) ,^ ) ); 
Let to = (<?, e;, e r ); 

Z(to') <- Z(m') + Z(m)W%? ns (q,q J )t r (i h i' l )£; 

Function sum_paths_to used by Algorithm |3j This is used during the 
Forward algorithm to compute the sum-over-paths likelihood ending at 
a given state. This quantity is later used to guide stochastic sampling 
(Algorithms [5] and |6| . 
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Input: Z 

Output: A path it through M n sampled proportional to Wj^ ans (7r). 

Initialize n -s— (4>E-.M n ) 
Initialize m! <— 4>E;M n 

while m! ^ <f>s-,M„ do 

Let K = {k e M n ■ W^ ans (fc, m') ^ 0} 

With probability — — ^p^ , P re P end - to - 

Set m! ■<— m 
return ir 

Algorithm 5: Pseudocode for Stochastic traccback for sampling paths 



through the transducer M n , described in Subsection 3.13 Stochastic sam- 
pling is done such that a path n through M n is visited proportional to its 
likelihood weight. By tracing a series of paths through M n and storing the 
union of these paths as a sequence profile, we are able to limit the number 
of solutions considered during progressive reconstruction, reducing time 
and memory complexity. 



3.18 Alignment envelopes 

Note that states e e of the constrained-expanded model, as with states 
9 € 3>g„ °f the expanded model, can be associated with a vector of subsequence 
co-ordinates (one subsequence co-ordinate for each leaf-sequence in the clade 
descended from node n). For example, in our non-normal construction of V(5), 
the state <f> € Zigi+i is itself the co-ordinate. The co-ordinate information 
associated with e; and e r can, therefore, be used to define some sort of alignment 
envelope, as in [34]. For example, we could exclude (e;,e r ) pairs if they result 
in alignment cutpoints that are too far from the main diagonal of the sequence 
co-ordinate hypercube. 

Let the function is_in_envelope( ii , i r ) return true if the state-index pair (j; , i r ) 
is allowed by the alignment envelope, and false otherwise. Further, assume that 
Z is a sparse container data structure, such that Z(m) always evaluates to zero 
if m = (p, v, bi,ei, b r , e r ) is not in the envelope. 

3.18.1 Construction of alignment envelopes 

Let V(5) be defined such that it has only one nonzero- weighted path 

„ T symbol(s,i) , , „ r symbol(s.2) , , TrT symbol(S,len(s)) , , 

A -> Wo 3 -> Mi -> Wi -> Mi -> . . . -> W L ^ y V M len(s) -> Wi en(s) -> A len(s 

so a V(5)-state is either the start state (A ), the end state (Aj^^), a wait 
state (Wi) or a match state (Mi). All these states have the form fa where i 
represents the number of symbols of S that have to be read in order to reach 
that state, i.e. a "co-ordinate" into S. All V(5)-states are labeled with such 
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Input: Z 

Output: A path it through M n sampled proportional to Wj^ ans (7r). 

Initialize n -s— (4>E-.M n ) 
Initialize m! <— 4>E;M n 

while to' ^ 4>s-,M„ do 
Initialize states_in <— () 
Initialize weights_in () 
Set (g J ,e' l ,e' r ) <- m! 
if to' = (f>E;M n then 

foreach q G $ wait do 

Add (</, 4>^^\ </>(^ nC ^) to states_in 

Add Z((g>^ nd \^ nd) ))W^ ans ( g ,0 B;Q J to weightsin 

else 

if W^ rw (^ i Q/ B ,gTo)t,(0,i{)tr(0,i / r ) ^ then 

Add {4>s;Q' n , <t>ST 4>s) to statesdn 

Add W^ ans (<?!)s;Q; i ,«ro)t ; (0,^)ir(0,i;) to weightsjn 
foreach i; 6 incoming_left_profile-indices{i' l ) do 

foreach j r G incoming_right_profile-indices(i' r ) do 

Let (e J ,c r ) = (^ l) ) ^ ) ); 
foreach g G incoming _match_states(q') do 
Add (q,ei,e r ) to states_in; 

Add Z(( g ,e i ,e I .))W^ ans ( <Z ,(?')^(^^;)^(v,«;) to 
weights_in; 

foreach i; G incomingJeft-profile-indices(ii) do 

Let (e i) e r ) = (^ !) ,^ ) ); 
foreach 5 g incoming _left_emitstates{q') do 
Add (9, e;,e r ) to states_in; 

Add Z((g,e h e r ))W£^(g,gO*;(Mi) to weightsin; 
foreach i r G incoming_right_profi,le-indices{i' r ) do 
Let r = get_state_type(y , left); 

Let(e i ,e r ) = (# ) ,^ ) ); 

foreach 5 G incoming _right_emit_states{q') do 
Add (q,ei,e r ) to states_in; 

Add Z((g,e;,e r ))W^ r ^(g,gOMw'r) to weightsjn; 

Set to «— sample(states_in, weights_in) 

Prepend to to 7r 

Set m' <— to 
return ir 

Algorithm 6: Pseudocode for stochastic traceback for sampling paths 



through the transducer M n , described in Subsection 3.13 Stochastic sam- 
pling is done such that a path tt through M n is visited proportional to its 
likelihood weight. By tracing a series of paths through M n and storing the 
union of these paths as a sequence profile, we are able to limit the number 
of solutions considered during progressive reconstruction, reducing time 
and memory complexity. 
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co-ordinates, as are the states of any transducer that is a composition involving 
V(5), such as G n or H n . 

For example, in a simple case involving a root node (1) with two children 
(2,3) whose sequences are constrained to be 82,83, the evidence transducer is 

G = RG mot = R(G 2 o G 3 ) = R(T(B 2 W(S 2 ), £ 3 V(S 3 ))) = R 

V[S 2 ] V[5 3 ] 

All states of G have the form g — (r, b 2 , 02*2, &3, ^3*3) where <f>2,4>3 € 
{X, W, M}, so 4> 2 i 2 € {X i2 ,Wi 2 , M i2 } and similarly for ^313. Thus, each state 
in G is associated with a co-ordinate pair (12,13) into (62, S3), as well as a 
state-type pair (fai,^)- 

Let n be a node in the tree, let C n be the set of indices of leaf nodes descended 
from n, and let G n be the phylogenetic transducer for the subtree rooted at n, 
defined in Subsection |3.11.5| Let <& n be the state space of G n . 

If to € C n is a leaf node descended from n, then G n includes, as a component, 
the transducer V(5 m ). Any G„-state, g £ $„, is a tuple, one element of which is 
a V(5 m )-state, fa, where i is a co-ordinate (into sequence S m ) and is a state- 
type. Define i m {g) to be the co-ordinate and 4> m {g) to be the corresponding 
state-type. 

Let A n : $„ — > 2 Cn be the function returning the set of absorbing leaf indices 
for a state, such that the existence of a finite-weight transition g' — > g implies 
that i m (g) = i m (g') + 1 for all to G A n (g). 

Let (Z,r) be two sibling nodes. The alignment envelope is the set of sibling 
state-pairs from Gi and G r that can be aligned. The function E: $/ x <& r — > 
{0, 1} indicates membership of the envelope. For example, this basic envelope 
allows only sibling co-ordinates separated by a distance s or less 

^basic (/> 9) = max \i m (/) - i„ (g) | < s 

meA i (/),n£A r (g) 

An alignment envelope can be based on a guide alignment. For leaf nodes 
x, y and 1 < i < len^^,), let ^(a;, i, j/) be the number of residues of sequence S'y 
in the section of the guide alignment from the first column, up to and including 
the column containing residue i of sequence S x - 

This envelope excludes a pair of sibling states if they include a homology 
between residues which is more than s from the homology of those characters 
contained in the guide alignment: 

£ 'guide(/'5') = „ , max „ , , max ( \G(m,i m (f),n)-i n (g)\ ,\g(n,i n (g),m)-i m (f)\ ) < s 

Let K(x,i,y, j) be the number of match columns (those alignment columns 
in which both S x and S y have a non-gap character) between the column con- 
taining residue i of sequence S x and the column containing residue j of sequence 
S y . This envelope excludes a pair of sibling states if they include a homology 
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between residues which is more than s matches from the homology of those 
characters contained in the guide alignment: 



£; euidc(/'-9) = , max , max ( \0(m,i m (f), n) - K(m, i m (f), n, i n (g))\, 

\G{n,i n (g),m) - K(n,i n (g),m,i m (f))\ ) < s 

3.19 Explicit construction of profile E n from M n , following 
DP 

Having sampled a set of paths through M n , profile E n is constructed by applying 
a series of transformations. 

Refer to the previous sections for definitions pertaining to M n and E n . 

3.19.1 Transformation M n — > M' n : sampled paths 

Let II' C IIm' be a set of complete paths through M n , corresponding to K 
random samples from P(jr\M n ). 

The state space of M' n is the union of all states used by the paths in II M ^ 

M 

Tren' i=i 

where 7Tj is the i th state in a path n <G (tmJ*. 

The emission and transition weight functions for M' n are the same as those 
ofM„: 

W^ ns (m,m') = W^ ns (m,m') 
W^Hccm') = W£f(e,e,m') 

3.19.2 Transformation M' n — > E": stripping out the prior 

Let E'n be a the transducer constructed via removing the R states from the 
states of M' n . 

= {{v,b h eub r ,e r ) : 3(p,v 7 bi,ei,b r ,e r ) € M' n } 

The emission and transition weight functions for E^ are the same as those 
oiH n : 

W^f ns (e,e') = w£f is (e,e') 
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3.19.3 Transformation E'^ — > E' n : eliminating null states 

Let E' n be the transducer derived from E^ by marginalizing its null states. 
The state space of E' n is the set of non-null states in E'^: 

= {e : e G E%, type(e) € {S, E, D, W}} 

The transition weight function is the same as that of H n (and also E") with 
paths through null states marginalized. Let 



7r.E;'(e,e') = {n ■ = e, 7T|^-| = e' , typefa) = N V i : 2 < i < \ir\} 
W^ ans (e,e') = Yl W^ ns (Tr) 

The transition weight function resulting from summing over null states in 
can be done with a preorder traversal of the state graph, outlined in Algorithm]?} 
The stack data structure has operations stack. push(e), which adds e to the top 
of the stack, and stack.popQ, which removes and returns the top element of the 
stack. The weights container maps states in to real- valued weights. 

Besides the states {</>£■, <^ nC ^}, the remaining states in E' n are of type 
D, which we now index in ascending topological order: 

*K = 4>E^T A) } U {4$ ■■l<n<N n ] 
where 1 < i, j < N n , 

tn(i,j) 
*n(0,j) 
t n (i,N n + l) 



ee W^ ans (^,^ nd) ) 



3.19.4 Transformation E' n — > adding wait states 

Let E n be the transducer derived from transforming E' n into strict normal form. 
Since N states were removed in the transformation from E" — > E' n , we need 
only add W states before each D state: 

$E n = Ws, <^E,4> { w A) } U {4$ :l<n<N}U {<f$ : 1 < n < N} 

Note the following correspondences between the previously- defined yV^ rans (e, e') 
and new notation. 
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Input: $ B » , W^ ns 
Output: Wj£ ans " 

Let = {e: e € E%,type(e) € {S, E, D, W}} 
Initialize inew(e, e') = V(e, e') 6 3>E' 
foreach sourcestate £ $E'^ do 

Initialize stack — [source state] 

Initialize weight s[s our ce state] = 

while stacfc 7^ [] do 
Set e = stack. popQ 

^e>[ ■■ y^E- 

if type(e') ^ N then 

inew(sowrce, e')+ = weights[e] ■ Wffi ns {e, e') 

else 

stack. push(e') 

weights[e'] = weight[e] ■ y\4 r ,? ns (e, e') 

return Wj^fe, e') = i neW (e, e') V(e, e') £ $ B; 

Algorithm 7: Pseudocode for transforming the transition weight function 
of E" into that of E' n via summing over null state paths (insertions). This 
is done after stochastic sampling as the first of two steps transforming a 



foreach e' e {e' e $ B » : W^ ns (e, e') ^ 0} do 



sampled M n transducer into a recognizer E n , described in Subsection 3.19 
Summing over null states ensures that these states cannot align to sibling 
states in the parent round of profile-profile alignment. An insertion at 
branch n is, by definition, not homologous to any characters outside the n- 
rooted subtree, and null state elimination is how this is explicitly enforced 
in our algorithm. 



83 



w 



trans ( 



yytrans^ 
w via-wait^ 

yyto-Wait( 

to- wait 



w 



En 



tn(i,j) 
1 

t n (i,j) 
S(i = j) 



3.20 Message-passing interpretation 

In the interpretation of Felsenstein's pruning algorithm 1 and Elston and Stew- 
art's more general peeling algorithm 35J as message-passing on factor graphs 
[12], the tip-to-root messages are functions of the form P(T> n \S n — x) where S n 
(a random variable) denotes the sequence at node n, x denotes a particular value 
for this r.v., and T> n = {S m : m g C n } denotes the observation of sequences at 
nodes in £„, the set of leaf nodes that have n as a common ancestor. 

These tip-to-root messages are equivalent to our evidence-expanded trans- 



ducers (Subsection 3.11.5): 



P(V n \S n =x)=W(x: [G n ] :e) 

The corresponding root-to-tip messages take the form P(T> n ,S n = x) where 
T) n = {S m : m € C,m ^ £„} denotes the observation of sequences at leaf nodes 
that do not have n as a common ancestor. These messages can be combined with 
the tip-to-root messages to yield posterior probabilities of ancestral sequences 

p ,~ _ ^_ PCDn,S n = x)P(p n \S n = x) 
r{i> n -X\U) - P~(p) 

We can define a recursion for transducers that model these root-to-tip mes- 
sages, just as with the tip-to-root messages. 
First, define J\= R. 

Next, suppose that n > 1 is a node with parent p and sibling s. Define 



J n — J P {B n o (B S G S )) = J p 



J- j) 




T 

A 




l" 


\ 
B s 




G s 



Note that J n is a generator that outputs only the sequence at node n (because 
G s has null output). Note also that J n G n = Gq. 
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The root-to-tip message is 

P(V n ,S n =x)=W(e: [J n ] : x) 

The equations for G n and J n are transducer formulations of the pruning and 
peeling recursions 

P{V n \S n =x) = p ( s i = y\ s n = x)P(Vi\Si = y)j p ( s r = z\S n = x)P{V r \S r = z) 

P(V n ,S n =x) = J2P(D p ,S p = y)P(S n = x\S p =y)J2P(S s = z\S p = y)P(V s \S s = z) 

y z 

where (l,r) are the left and right children of node n. For comparison, 

G n = (Bid) ° (B r G r ) 
J n = J P {B n o (B S G S )) 
W(x : [B n ] : y) = P{S n = y\S p = x) 

W(x : [TU] : z) = ^ W ( x : I T 1 : v) W (v : \ U ] : z ) 

y 

W{x : [Tj] : Vt)W(x : [T P ] : P r ) = W(s : [T, o T r ] : 2? n ) 

4 Conclusions 

In this article we have presented an algorithm that may be viewed in two equiva- 
lent ways: a form of Felsenstein's pruning algorithm generalized from individual 
characters to entire sequences, or a phylogenetic generalization of progressive 
alignment. Our algorithm extends the concept of a character substitution ma- 
trix (e.g. [361 137j) to finite-state transducers, replacing matrix multiplication 
with transducer composition. 

We described a hierarchical approximation technique enabling inference in 
0(L 2 N) time and memory, as opposed to 0(L N ) for exact, exhaustive inference 
(N sequences of length L). Adding additional constraints (in the form of a 
"guide alignment") bring time and memory complexity down to O(LN), making 
the algorithm practical for typical alignment problems. 

Much of computational biology depends on a multiple sequence alignment 
as input, yet the uncertainty and bias engendered by an alignment is rarely 
accounted for. Further, most alignment programs account for the phylogeny 
relating sequences either in a heuristic sense, or not at all. Previous stud- 
ies have indicated that alignment uncertainty and/or accuracy strongly affects 
downstream analyses [38] , particularly if evolutionary inferences are to be made 

In this work we have described the mathematics and algorithms required 
for an alignment algorithm that is simultaneously explicitly phylogenetic and 
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avoids conditioning on a single multiple alignment. In extensive simulations (in 
separate work, submitted), we find that our implementation of this algorithm 
recovers significantly more accurate reconstructions of simulated indcl histories, 
indicating the need for mathematically rigorous alignment algorithms, partic- 
ularly for evolutionary applications. A subset of these results are described at 
http : //www . biowiki . org/ProtPal 

The source code to this paper, including the graphviz and phylocomposer 
files used to produce the diagrams, can be found at https : / /github . com/inn/ 
transducer-tutorial 

Authors' contributions: OW and IH developed the algorithm and wrote the 
paper. BP and GL made substantial intellectual contributions to developing the 
algorithm. 

Funding: OW and IH were partially supported by NIH/NHGPJ grant R01- 
GM076705. 

A Additional notation 

This section defines commonplace notation used earlier in the document. 
A.l Sequences and alignments 

Sequence notation: Let f2* be the set of sequences over fi, including the empty 
sequence e. Let x + y denote the concatenation of sequences x and y, and J2 n x n 
the concatenation of sequences {x n }. 

Gapped-pair alignment: A gapped pairwise alignment is a sequence of indi- 
vidual columns of the form (uji,uj 2 ) where oj-y G (f^i U {e}) and u>2 G (^2 U {e}). 

Map from alignment to sequences: The function : ((Cli U {e}) x (f2 2 U {e}))* — > 
Qjj returns the fc'th row of a pairwise alignment, with gaps removed 

Si{x) = ^2 w i 

s 2 (x) = uj 2 

Transition paths: A transition path 7r G LT is a sequence of transitions of the 
form (fa, (jjj, wo, 4>2 ) where <t>i,4>2 G $, wj G (Qj U {e}), and u>o G (^O U {e})- 
Input and output sequences: Define the input and output sequences Si : 

n -> n*j and s ■■ n -> n* 

(0i,wj,wo,02)eir 
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